K-Means is an unsupervised Machine Learning algorithm used for data clustering. It is a technique employed to classify unlabelled data into a number K of clusters or groups based on their similarities.

The number of clusters K is an input to the model. There exist different techniques to determine its optimal value that will be discussed in future articles.

It is a very simple and easy-to-interpret algorithm. It can be better understood when explaining its training procedure.

How is a K-Means model trained?

  1. K data points are chosen randomly as centroids of the K clusters.
  2. The distance of each data point to each centroid is computed.
  3. Each data point will be assigned the cluster whose centroid is the closest.
  4. The centroids are recalculated based on the data points that belong to each cluster. The “means” part of the model name comes from averaging the data to find the centroids.
  5. Repeat steps 2 to 4 until:
    • The centroids have stabilized, meaning that there is no change in their values.
    • The specified number of iterations has been achieved.

Visualization of optimal clustering. K = 3

How can we predict the cluster a new data point belongs to?

In order to get the predictions from this model, we will need to follow a similar process to the one followed during training.

  1. Calculate the distance to each K centroid.
  2. Assign the cluster whose centroid is the closest.

What are its advantages and disadvantages?

Advantages

  • Easy to understand and interpret.
  • It is faster than other clustering algorithms when the number of features is very large.
  • It produces tighter clusters than other clustering algorithms.

Disadvantages

  • The number K of clusters is not easy to determine. Some techniques such as the elbow method can be used to estimate its optimal value.
  • The data needs to be scaled or normalized as this model is very sensitive to the scale of the different features.
  • It may not be suitable for particular geometrical shapes of the data clusters.
  • It is sensitive to the initialization of the centroids. A poor initialization could lead to a poor clustering.

Visualization of a poor clustering due to a poor initialization of centroids. K = 3

Centroids initialization

One way of improving the clustering is repeating the K-Means training process several times and selecting the model that has the largest intercluster distance and the smallest intracluster distance.

Also, we can use the K-Means++ initialization to ensure a smarter initialization of the centroids and improve the quality of the clustering.

The steps for initializing the centroids with the K-Means++ algorithm are the following:

  1. Randomly pick the first centroid.
  2. Calculate the distance of all data points to the selected centroid.
  3. The next centroid is chosen as the data point that is the farthest from the nearest centroid.
  4. Repeat steps 2 and 3 until selecting the K centroids.

How can I use it in Python?

Finally, this model can be used in Python thanks to the scikit-learn library KMeans. It is worth mentioning that the centroids initialization is set by default to the K-Means++ algorithm.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *