Imagine that you are the chief campaign planner for the next presidential election. Thanks to the pandemic-driven technology adoption, campaigns are going online this time. Because it helps cover the entire nation without any traveling, your party decides to take innovative approaches.
Your task is to find groups of people with similar interests (or needs). There is going to be a separate online campaign for each of them. How would you go about it if the future of your country depended on you?
Data scientists use clustering algorithms to help with this problem. K-means is the simplest yet most effective algorithm to group large datasets using various properties. It’s an iterative approach to finding non-overlapping groups in the dataset. In your case, finding distinct voters groups who share similar needs and interests. Each individual becomes a member of one and only one group.
K is the number of clusters you prefer to have. But how do you know what the correct number is? Read through to find out.
This article will take you through the steps to solve the puzzle and answer some of the crucial decisions you must make on your way. In this article, I’ve used Python to implement the examples we discuss.
You can refer to the GitHub repository if you feel lost. The dataset I used for this illustration isn’t original. If you still want it to practice, that’s in the repository too.
Finding clusters with two variables.
Let’s suppose you have access to the age annual income (and debt) levels of individuals of your country (in a spreadsheet, like in the image below).
Of course, you could create groups based on your prior knowledge. For example, millennials with high incomes would be one of the groups. But you decide to dive deep to see if there are other ways to cluster them. Hence you decide to perform K-means with the two available variables.
Doing it using Python is effortless. You only need a few lines of code to read the data and perform K-means clustering.
You now know what the two more prominent groups are to consider and their average age and income. You also know to which group each individual belongs.
That’s incredible. But we need to see it with our own eyes to understand it better, don’t we? Again this is straightforward with a few lines of codes.
Visualizing helps. We could see the clusters are well separated, and their cluster centroids are representative.
An interesting question to ask at this point is how do we know that there are only two clusters? Could there be more? Let’s repeat the same, assuming the dataset has three groups.
Clustering with only two variables makes the visualization simple. But you could even create logical groups if there are only two features to consider. Can we add more features?
Also, we could increase the number of clusters to any value. But what is the optimal number?
We’ll answer these questions in the coming sections. But before we move there, let’s try to understand how the K-means algorithm works.
What’s happening under the hood?
K-means uses distances between data points. Thus, data points that are closer to each other are likely to fall into the same cluster.
In the beginning, K-means randomly chooses K data points as its cluster centroids. Here, K is an arbitrary number we defined. In our example, this could be age 25 and income of 20k (cluster A) and age 50 and income of 100k (cluster B).
It then calculates the distance between all the data points in our dataset and these centroids. And the algorithm assigns each data point to the closest centroid. So that means a 20-year-old individual with 30k income would be in cluster A and a 60-year-old person with 80k income may be in cluster B.
K-means calculates a new set of centroids by averaging all the data points of each cluster. Thus, the new cluster centroids of cluster A could be age 22 and 25k income. Centroids of cluster B may have changed as well. These will be the centroids for the next iteration.
This process continues until there is no significant change in the cluster centroids.
What if you want to consider more features?
Finding clusters with age and income only seems straightforward. You may not need machine learning to do this. But, In reality, you’ll have a bunch of other variables to work on. This may include their education level, health, and financial history.
The two dimensions, however, helped us understand the algorithm. Also, it was easy to visualize with only two variables. But let’s add a new variable to make it more realistic.
Say you also have the debt levels of each individual. Isn’t this a critical piece of information for political campaigns?
You can do this without much alteration to the code you’ve already written. Just add “Debt” to the list of variables you’re passing.
Finding a correct value for K.
K-means finds the clusters, but it does not determine the number of sets; you have to define it yourself.
If you’ve used a smaller K value, you might not address the correct issues of voters. Your promises may seem superficial because people with different needs are now in the same cluster.
On the other hand, if you have used a large number for K, campaigns will be costly. As a result, your party may end up pleasing every single niche that is worthless.
But how do you know the arbitrary number you came up with is the right fit?
The elbow method could help you find this.
We use the Sum of Squared distances (SSE) between data points and their cluster centroids to determine the K value in this method. We call this value inertia. With more clusters (larger K, ) inertia will always decrease as the distance to the nearest clusters decreases.
Yet, at some point, the reduction becomes insignificant. Hence, we conclude the number of clusters at this point is the correct value for K.
Here is the Python code snippet to do this.
We fit K-means with different K’s and collect the inertia values in a list. We then use this information to plot the graph below.
On this graph, it’s clear that the drop in inertia isn’t significant after three clusters.
This information means that people who fall into the same cluster have similar characteristics if you group voters into three groups. You may plot three different strategies to address them all.
Conversely, four clusters would make your campaigns expensive because each demands a different strategy. Two clusters could make the voters feel their needs aren’t met.
Drawbacks of K-means clustering.
You may have a question about the initial cluster centroids of K means. If it’s chosen randomly, wouldn’t the clusters be different each time the algorithm runs?
Absolutely, stumbling on a local minimum is a concern with the K-means algorithm.
To solve this problem, you may have to run the algorithm several times and select the one that fits well. The total sum of squared distances between cluster centroids and their members could help us find the right fit. The lower this number, the better.
The K-means algorithm is sensitive to outliers. Since the algorithm calculates new centroids at every iteration, outliers could distort it. Make sure to remove all outliers before you begin.
But, these drawbacks are more manageable. What about some serious issues that make K-means unsuitable?
K-means perform poorly for clusters with varying sizes or densities. Thus, if the dataset has more old-age people with high income than mid-age low-income earners, K means may end up with meaningless clusters.
K means to perform well for linearly separable clusters. But how about other types of data? K-means is incapable of clustering different geometric shapes such as elliptical and circles.
Nevertheless, K-means has always been the favorite for data scientists.
K-means is a simple and popular clustering algorithm. It groups data points in non-overlapping clusters.
In this article, we’ve discussed how to perform K-means clustering with Python. We started with two variables. Then we talked about adding more variables and finding a correct number for K.
Throughout this article, we’ve followed a hypothetical political campaign scenario. Although, in reality, you may have innumerable other variables, the basic steps are the same. Also, you can use the K-means algorithm in countless other applications. For example, you can use it to compress images as well as classify documents on your PC.
But that doesn’t mean it’s the best all the time. We’ve discussed several drawbacks of the K-means algorithm. For example, K-means doesn’t do well on complicated geometric shapes. Algorithms such as DBSCAN could solve it better. But that’s for another story.
Not a Medium member yet? Please use this link to become a member because I earn a commission for referring at no extra cost for you.