Implementation of K-Means Clustering Algorithm for Grouping Traffic Violation Levels in Siak

Traffic offences often occur in different regions, ranging from mild to moderate to severe. The categories of offences include not carrying a Driver's Licence, stnk (Vehicle Number Certificate) or stck (Vehicle Trial Certificate) is invalid, not wearing a seat belt, not turning on headlights during the day and under certain conditions, disobeying traffic signs, disobeying traffic signals. Moderate offences include not having a Driver's Licence, not concentrating while driving and breaking the door of the drawbar. Serious violations include deviating from other vehicles on the road, damaging and interfering with road functions, not insuring one's own responsibility and not insuring staff and passengers. In this study, the K-Means algorithm was used with the aim of obtaining information on data groups of traffic violations based on the time of the incident so that the cause of the traffic violations that occurred in Tasikmalaya City is known. Based on the validation with Davies Bouldin Index metric, 4 clusters were identified which can group the data well. The PerformanceVector results from the assessment of the clusters resulted in 4 clusters with a value of 0.134. Cluster 1 with the most data violations amounting to 74 violations occurred at night, Cluster 2 with the most violations amounting to 16 violations occurred during the day, Cluster 3 with the most violations amounting to 6 violations occurred in the afternoon and Cluster 4 with the most violations amounting to 113 violations occurred in the morning.


Introduction
The number of vehicles is increasing rapidly, according to data from the Central Statistics Agency (BPS), which shows that the stock of all vehicles in Indonesia will exceed 133 million units in 2019. The number of vehicles, broken down by type, includes 15,592,419 passenger cars, 231,569 buses, 5,021,888 freight cars and 112,771,136 motorbikes. The cumulative increase in traffic, both two-wheeled and fourwheeled, leads to a traffic situation that is increasingly congested and uncontrollable, and indirectly increases the risk of growing traffic problems, as the increase in traffic is not proportional to the number of road widenings. From the perspective of social psychology, the problem makes drivers look for shortcuts or the fastest roads, and it certainly leads to traffic violations when no one is watching. Traffic violations are cultivated in the community, including in the Siak Regency area. Every day, the number of road users who do not obey traffic rules can increase the number of road accidents and traffic violations in the Siak Regency region, so people do not understand the order of the road.
The application of K-Means clustering for accident data analysis has been done by Iswari (2015), Fajar (2015) and Rahmat et al (2017). Iswari (2015) uses the K-Means algorithm for mapping accident-prone areas in Sleman, Yogyakarta Special Region. Rahmat et al. (2017) used K-Means clustering to analyse the frequency of accident rates at each location with the potential for accident occurrence in Kendari city. Iswari and Rahmat classified the road accident data into clusters based on accident prone areas. Fajar (2015) also used K-Means in his study to classify the accident data in Semarang into several clusters of accident rate categories based on the age of the accident victims.
Based on the existing problems, the clustering method is used to group the traffic violation data using the K-Means algorithm, where the expected result is information about the traffic violation data cluster based on the time of the incident, so that the cause of the violation that occurred in Siak regency is known.

Reseach Methodology
The researchers used a clustering method using the K-Means algorithm to analyse data on traffic violations in the Siak Regency area. So that the research can be conducted in a more targeted manner. Here is the methodology of the research conducted by the researcher: The identification of problems in this study is to analyze the incidence of traffic customers in the Siak Regency area using the K-Means method.

Data Collection
The data collected was sourced from the Siak District Attorney's Office in 2019. 3. Data Processing Data processing in this study used a clustering method with a K-means algorithm on traffic violation data in the Siak Regency area. So that the results of this study can be useful for evaluation and in order to minimize the incidence of traffic violations.

Evaluation of Research Results
At this step, the author tested the results of the study using RapidMiner software in connecting the database to be tested. So that the information generated by the data mining process can be displayed in a form that is easy to understand for research.

Conclusion of the Research Results
At this step, the completed research will be given conclusions from the existing problems so that they can be used by the authorities.
The method used in this study was clustering with the K-Means algorithm. Clustering is one of the analytical techniques in data mining that groups data based on similar features. By the similarity of the features.
Meanwhile, K-Means is one of the algorithms or methods of non-hierarchical cluster analysis that attempts to divide existing objects into one or more clusters or groups of objects based on their characteristics and determine their closest points. Figure 2 shows a K-Means flowchart.
where : xi = criteria data = Centroid at the Cluster toj 3. Group the data in clusters with the smallest distances.
4. Update the value of the center point by performing the cluster average.
5. Repeat steps 1, 2, 3 until all members of the cluster are no longer changed. 6. If the fifth step is already fulfilled, which is used as a parameter to determine the accuracy of the data, the last value of the cluster centre.

Result
This step ensures that the selected data on victims of violations are suitable for processing. The original data obtained by the author amounted to 445 traffic violation data with a total of 9 attributes. Data Transformation Data of a nominal nature such as crime location, time, age, sex, condition of the victim, occupation and vehicle involved must first be initialised in the form of numbers or numerical values. This initialisation can be done by sorting the numbers according to their frequency. The first step of the k-means algorithm is to determine the number of clusters. In this study, there are 4 clusters according to the formation of time groups for traffic violations, both morning, afternoon, evening and night. The initial cluster was randomly determined with the attributes of offence location, time, age, gender, victim's circumstances, occupation and type of vehicle involved.
Below is a dataset of traffic violations that has gone through the pre-processing steps, including the data cleaning step, the nominal type data initialisation step and the final step of transforming all data that has gone through the initialisation step. The next step is to determine the initial centre of the cluster (centroid), which is chosen at random. In this study it was chosen from the 38th, 9th, 4th and 57th dates. After determining the initial centroid, the next step is to calculate the distance of each data to the nearest centroid to determine the cluster that the data follows using the Euclidean distance formula. Here you can see the complete calculation of iteration 1.  In addition, the data group corresponds to the nearest cluster. From this data, a new centroid is determined based on the average results of each cluster.

Discussion
Based on the clustering process with the k-means algorithm using the Rapidminer application, the following information is obtained: From the above description, it can be concluded that in cluster 0, violations often occur at night, one of the causes is insufficient light intensity or illumination. In cluster 1, many violations occur during the day because it is school time. In cluster 2, many violations occur in the afternoon, which is due to the fact that the afternoon is the time when people return from work. In Cluster 3, many violations also occur in the morning because morning activities begin for students, employees, and others.

Conclusion
Based on the research results on injury data grouping using the k-means algorithm, the following conclusions were drawn: Based on the validation of the Davies-Bouldin Index (DBI), with a total sample of 209 data formed into 4 clusters with an accuracy value of 0.939; From the grouping of traffic violation data based on the time of occurrence of violations, which is divided into 4 clusters, it produces information on cluster 0 with the number of data breaches as many as 74 violations that occur at night, cluster 1 with the number of violations as many as 16 violations that occur during the day, cluster 2 with the number of violations as many as 6 violations that occur in the afternoon, and cluster 3 with the number of violations as many as 113 violations that occur during the day morning; Based on the information about the time of violation, there can be a knowledge for the Laka Lantas unit of siak regency police by connecting with the place where the traffic violation occurred to carry out appropriate treatment to reduce the number of traffic violations in Siak.