K-Means is an unsupervised Machine Learning algorithm used for data clustering. It is a technique employed to classify unlabelled data into a number K of clusters or groups based on their similarities.
The number of clusters K is an input to the model. There exist different techniques to determine its optimal value that will be discussed in future articles.
It is a very simple and easy-to-interpret algorithm. It can be better understood when explaining its training procedure.
How is a K-Means model trained?
- K data points are chosen randomly as centroids of the K clusters.
- The distance of each data point to each centroid is computed.
- Each data point will be assigned the cluster whose centroid is the closest.
- The centroids are recalculated based on the data points that belong to each cluster. The “means” part of the model name comes from averaging the data to find the centroids.
- Repeat steps 2 to 4 until:
- The centroids have stabilized, meaning that there is no change in their values.
- The specified number of iterations has been achieved.
Visualization of optimal clustering. K = 3
How can we predict the cluster a new data point belongs to?
In order to get the predictions from this model, we will need to follow a similar process to the one followed during training.
- Calculate the distance to each K centroid.
- Assign the cluster whose centroid is the closest.
What are its advantages and disadvantages?
- Easy to understand and interpret.
- It is faster than other clustering algorithms when the number of features is very large.
- It produces tighter clusters than other clustering algorithms.
- The number K of clusters is not easy to determine. Some techniques such as the elbow method can be used to estimate its optimal value.
- The data needs to be scaled or normalized as this model is very sensitive to the scale of the different features.
- It may not be suitable for particular geometrical shapes of the data clusters.
- It is sensitive to the initialization of the centroids. A poor initialization could lead to a poor clustering.
Visualization of a poor clustering due to a poor initialization of centroids. K = 3
One way of improving the clustering is repeating the K-Means training process several times and selecting the model that has the largest intercluster distance and the smallest intracluster distance.
Also, we can use the K-Means++ initialization to ensure a smarter initialization of the centroids and improve the quality of the clustering.
The steps for initializing the centroids with the K-Means++ algorithm are the following:
- Randomly pick the first centroid.
- Calculate the distance of all data points to the selected centroid.
- The next centroid is chosen as the data point that is the farthest from the nearest centroid.
- Repeat steps 2 and 3 until selecting the K centroids.
How can I use it in Python?
Finally, this model can be used in Python thanks to the scikit-learn library KMeans. It is worth mentioning that the centroids initialization is set by default to the K-Means++ algorithm.