KMeans Clustering With Example
Clustering is the process of making a group of abstract objects into classes of similar objects. Having similarity inside clusters to be high and low clustering similarities between the clusters.
- A cluster of data objects can be treated as one group.
- While doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign the labels to the groups.
- The main advantage of clustering over classification is that it is adaptable to changes and helps single out useful features that distinguish different groups.
Cluster:-Cluster is basically a collection of similar objects.
Clustering: –A technique of dividing a large group of objects into a number of groups such that the objects which belong to the same group are most similar to one another or show similar behavior and the objects of different groups are most dissimilar to one another. We do not have the prior knowledge of the classes of the objects which is called unsupervised learning. So, the process of combining the objects into classes of similar objects is called clustering.
KMeans Algorithm
The algorithm is composed of the following steps:
- It randomly chooses K points from the data set.
- Then it assigns each point to the group with the closest centroid.
- It again recalculates the centroids.
- Assign each point to the closest centroid.
- The process repeats until there is no change in the position of centroids.
Example of KMEANS Algorithm
Let’s imagine we have 5 objects (say 5 people) and for each of them, we know two features (height and weight). We want to group them into k=2 clusters.
Our dataset will look like this:
Height (H) | Weight(W) | |
Person 1 | 167 | 55 |
Person 2 | 120 | 32 |
Person 3 | 113 | 33 |
Person 4 | 175 | 76 |
Person 5 | 108 | 25 |
First of all, we have to initialize the value of the centroids for our clusters. For instance, let’s choose Person 2 and Person 3 as the two centroids c1 and c2 so that c1=(120,32) and c2=(113,33).
Now we compute the Euclidean distance between each of the two centroids and each point in the data.
Distance of object from c1 | Distance of object from c2 | |
Person 1 | 52.3 | 58.3 |
Person 2 | 0 | 7.1 |
Person 3 | 7.1 | 0 |
Person 4 | 70.4 | 75.4 |
Person 5 | 13.9 | 9.4 |
At this point, we will assign each object to the cluster it is closer to (that is taking the minimum between the two computed distances for each object).
We can then arrange the points as follows:
Person 1 → cluster 1
Person 2 → cluster 1
Person 3 → cluster 2
Person 4 → cluster 1
Person 5→ cluster 2
Let’s iterate, which means to redefine the centroids by calculating the mean of the members of each of the two clusters.
So c’1 = ((167+120+175)/3, (55+32+76)/3) = (154, 54.3) and c’2 = ((113+108)/2, (33+25)/2) = (110.5, 29)
Then, we calculate the distances again and re-assign the points to the new centroids.
We repeat this process until the centroids don’t move anymore (or the difference between them is under a certain small threshold).
In our case, the result we get is given in the figure below. You can see the two different clusters labeled with two different colors and the position of the centroids, given by the crosses.
The two different clusters in blue and green. The crosses indicate the position of the respective centroids.
Flowchart of KMeans Clustering