Clustering Villages Based on Distance and Accessibility to Health Facilities Using the K-Means Method

There are 47 very underdeveloped and 63 underdeveloped villages in Melawi regency. More than 50% of the villages have no health facilities, and the percentage of road lengths with good condition is only 20.53% in Melawi County. One of the most important factors influencing health problems is the physical aspect such as the availability of health facilities. In addition, the distance and easy access to health facilities also influence how quickly people are treated and vaccinated during the Covid 19 pandemic. The objective of this study is to determine the degree of accessibility of health facilities in villages by forming village clusters that are likely to be important to the government in ensuring treatment and distribution of Covid 19 vaccine. The clustering method used is the K-Means method with Euclidean spacing to calculate the spacing of the data and the Elbow method to determine the optimal number of clusters on the data, and the Silhouette coefficient evaluation method to test the degree of accuracy of the model created with K-Means. The results of the Elbow method showed the optimal number of clusters to be 2 clusters. Based on the results of the K-Means algorithm process, the clusters that have a larger average distance and access is rated as difficult are cluster 1 with 92 villages in it, and cluster 1 has a smaller average distance and access is relatively easy with 77 villages in it. The result of the evaluation with the silhouette coefficient is 0.299. calculation of the SSE value (equation 1) for the resulting model with the value of k=2 in the k-means algorithm is the i-th total data of the formed cluster. The calculation process is carried out up to the 169th data and then the total results of calculating the distance from each data to the centroid.


Introduction
Social health problems, especially in developing countries such as Indonesia, are influenced by two factors, namely physical factors and non-physical factors. Physical aspects such as health facilities and disease treatment, the second is non-physical aspects related to health problems [1]. Distance and ease of access to health facilities are also important things that must be considered, especially during the Covid-19 pandemic, because it affects how quickly people get treatment and vaccinations [2]. Melawi Regency, West Kalimantan has 73 health facilities from 169 villages, and has good road conditions of 20.53% [3]. Therefore, it is important to know the level of coverage of village health facilities by forming clusters. There are some studies that use cluster algorithms, such as Fuzzy C-Means [4], and K-Means [5]. Fuzzy C-Means algorithm has a faster and easier process time to interpret [6], n however, it has weaknesses in the calculation process and fuzzy iterations that use longer time than the K-Means algorithm [7]. The K-Means algorithm is widely applied to research because it is more efficient in categorizing data with very large amounts, but this algorithm is not quite right in random selection of centroid starting points and determining the initial number of clusters [8].
The K-Means algorithm has a higher consistent rate and stands out than fuzzy C-Means, but when executed with different iterations Fuzzy C-Means stands out more than the K-Means algorithm. Based on these problems, researchers tested the level of accuracy of the model produced by the K-Means algorithm using the Silhoutte Coefficient method, applied the Elbow method to determine the best number of clusters, and the Euclidean Distance method to determine the distance of the data to the initial centroid point.

Research Methodology
Cross Standard Industry Processing for Data Mining (CRISP-DM) data mining methodology used in this study [9]. CRISP-DM has data mining standards as a commonly used solution in research and business [10]. This methodology consists of six steps, namely; Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation, and Deployment.

Data Understanding
At this step, village development data was collected from the Central Statistics Agency (BPS) of Melawi Regency. Village development data consists of 169 data records, with 46 variables. The criteria for the data variables used are the name of the village, the name of health facilities, the number of health facilities, the distance to reach the nearest health facility, and the ease of access to the nearest health facility.

Data Preparation
This step is necessary to build the raw data into the final dataset used at the model creation stage. Data preprocessing is carried out by three methods, namely: a. Data Cleaning Data cleaning is applied to remove noise, inconsistent, outliers, and missing values [11]. Data on village development in Melawi Regency is still missing, so this study uses the average method by calculating the average amount of data in the variable. b. Data Transformation Datayang digunakan di menjadi formattransformasi .csv. c. Data Reduction Data reduction is applied to reduce the volume of the dataset while maintaining data integrity [12]. The method used in this study is Feature Selection [13] to select the variables used in the modelling step. The variables used are the name of the village, the name of health facilities, the number of health facilities, the distance to reach the nearest health facility, and the ease of access to reach the nearest health facility.

Modelling
The techniques applied at this step have special conditions on the form of data, making it possible to return to the data preparation step. The tool used in this study is Jupyter Notebook with Python program language. Libraries used Scikit-lean and Matplotlib. The step of creating a cluster model with the K-Means algorithm is [14]: a. Determining the optimal number of k with the Elbow method [15]. The sum of k is determined based on the sum of square error values using equation (1) [16]. The number of k is selected by the largest margin of descent and forms an elbow on the chart, and then it is determined the initial centroid of each k.
Calculating the distance from each data to the centroid cluster using the Euclidean Distance method [17] with equation 2. c. The data with the shortest Euclidean distance will be grouped into one cluster. d. Calculations are performed to obtain a new centroid value in the next iteration, by calculating the average distance of each data in the cluster. e. The 2nd to 4th stages will be repeated in each iteration until the centroid value no longer changes.

Evaluation
At this step the model is evaluated to ascertain whether it meets the objectives. The method used is Sihouette Coeficient [18] to evaluate the cluster so that it can be known how well the cluster is formed. The resulting value can determine how good the cluster structure is, where if the value <= 0,25 is no-structure, value > 0,25 and <= 0,50 is weak structure, value > 0,50 and <= 0,7 is medium structure, value > 0,7 and <=1 is strong structure [19].

Data Preprocessing
Village development data of Melawi Regency consists of 169 records with 46 variables. The variables used are the name of the village, the name of health facilities, the number of health facilities, the distance to reach the nearest health facility, and the ease of access to reach the nearest health facility. The Average approach is taken to clean up the missing values. Variable selection using feature selection. The result of selecting variables using the toolsjupyter notebook in   The optimal number of k in this study used the Elbow method, because k-Means has a weakness in determining the number of initial clusters determined randomly [8]. The best number of k for clusters 1 to 10 using the Elbow Method is k=2. The highest Niali Sum Square Error (SSE) between values is used as the number of clusters (Table 4.1).   The calculation of the SSE value (equation 1) for the resulting model with the value of k=2 in the kmeans algorithm is the i-th total data of the formed cluster. The calculation process is carried out up to the 169th data and then the total results of calculating the distance from each data to the centroid. A cluster model with the k-Means algorithm is used to determine centroid points. The tools used are jupyter notebooks with Scikit-lean. The iteration process will stop if the centroid does not undergo displacement or change in value. The cluster results using 10 variables, and 169 record data are distance and ease of access for hospital health facilities, puskesmas with inpatient, puskesmas without hospitalization, auxiliary health centers, and pharmacies. The next step taken after determining the centroid point is to determine the distance of each data to centroid 1 or to centroid 2. The distance determination process is carried out using the Euclidean Distance method (Equation 2).
The results of village clusters with low and high health facility coverage based on distance and ease of access with analysis of the resulting cluster model. Villages that have a greater average distance to the nearest hospital are in cluster 1 with a value of 83.42 km. The average accessibility is 3, which means that it is classified as difficult. The villages in cluster 2 have an average distance closer to the value of 26.81 km. The average access density is a value of 3, which means that it is classified as difficult.
Villages that have a greater average distance to the nearest health center with hospitalization are in cluster 2 with a value of 23.19 km, and the average access density is a value of 2, which means that it is relatively easy. On the other hand, the villages in cluster 1 have an average distance closer to the value of 21.89 km, with an average accessibility value of 3, which means that it is classified as difficult. However, the villages that have a greater average distance to the nearest puskesmas without hospitalization are in cluster 1 with a value of 34.15 km, and the average accessibility is a value of 3, which means that it is classified as difficult. Villages in cluster 2 have an average distance of 22.02 km, and the average accessibility is a value of 2, which means that it is relatively easy. Villages that have a greater average distance to the nearest health center are in Cluster 1 with a value of 19.09 km, and the average accessibility is a value of 2, which means that it is relatively easy. Villages in cluster 2 are closer to the value of 12.96 km on average, and the average accessibility is a value of 2, which means that it is relatively easy. Villages that have a greater average distance to the nearest pharmacy are in cluster 1 with a value of 47. km, and the average accessibility is a value of 3, which means that it is relatively difficult. On the other hand, the villages in cluster 2 are closer to the value of 24.74 km on average, and the average accessibility is a value of 3, which means that it is classified as difficult.
Based on the description of the resulting pattern, the results of the analysis are that Cluster 1 has a low coverage of health facilities with 92 villages (see Table 4.5), while Cluster 2 has a high coverage of health facilities with 77 villages. It can be concluded that the 92 villages in Cluster 1 listed in Table 4.6 are villages that need more attention from the government to ensure public health in the villages of Melawi regency.

Conclusion
In this study, grouping of villages was done using K-Means algorithm based on distance and ease of access to health facilities, which is expected to form village clusters based on distance and ease of access to find out which village clusters have low and high coverage of health facilities. to confirm the quality of the resulting cluster model, the author uses Elbow method to determine the optimal number of clusters. The quality of the resulting model is tested using the silhouette coefficient. It can be concluded that: 1. Based on the pattern picture of cluster 1 and cluster 2, it can be concluded that the level of distance does not affect the ease of access to reach health facilities even after experiencing the clustering process. 2. Based on the average distance and overall access of health facilities in cluster 1 and cluster 2. It can be concluded that in villages that have a low range of health facilities are in cluster 1 with an average longer distance of 41.14 km, and the average access with a value of 3 which is classified as difficult, while villages in cluster 2 have an average closer distance of 21.94 km with an average access value of 2 which is relatively easy.