{"id":382,"date":"2021-07-03T00:00:00","date_gmt":"2021-07-03T00:00:00","guid":{"rendered":"https:\/\/tac.debuzzify.com\/?p=382"},"modified":"2023-06-27T06:30:11","modified_gmt":"2023-06-27T06:30:11","slug":"running-election-campaigns-with-k-means-clustering","status":"publish","type":"post","link":"https:\/\/www.the-analytics.club\/running-election-campaigns-with-k-means-clustering\/","title":{"rendered":"Running Election Campaigns With K-Means Clustering."},"content":{"rendered":"\n\n\n<p>Imagine that you are the chief campaign planner for the next presidential election. Thanks to the pandemic-driven technology adoption, campaigns are going online this time. Because it helps cover the entire nation without any traveling, your party decides to take innovative approaches.<\/p>\n\n\n\n<p>Your task is to find groups of people with similar interests (or needs). There is going to be a separate online campaign for each of them. How would you go about it if the future of your country depended on you?<\/p>\n\n\n\n<p>Data scientists use clustering algorithms to help with this problem. K-means is the simplest yet most effective algorithm to group large datasets using various properties. It\u2019s an iterative approach to finding non-overlapping groups in the dataset. In your case, finding distinct voters groups who share similar needs and interests. Each individual becomes a member of one and only one group.<\/p>\n\n\n\n<p>K is the number of clusters you prefer to have. But how do you know what the correct number is? Read through to find out.<\/p>\n\n\n\n<p>This article will take you through the steps to solve the puzzle and answer some of the crucial decisions you must make on your way. In this article, I\u2019ve used Python to implement the examples we discuss.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\">\n<p><i>You can refer to the<\/i> <a href=\"https:\/\/github.com\/ThuwarakeshM\/PracticalML-KMeans-Election\" target=\"_blank\" rel=\"noopener\"><i>GitHub repository<\/i><\/a> <i>if you feel lost. The dataset I used for this illustration isn\u2019t original. If you still want it to practice, that\u2019s in the repository too.<\/i><\/p>\n<\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\">Finding clusters with two variables.<\/h2>\n\n\n\n<p>Let\u2019s suppose you have access to the age annual income (and debt) levels of individuals of your country (in a spreadsheet, like in the image below).<\/p>\n\n\n\n<p>\u00a0<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"720\" height=\"261\" src=\"https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/image-106.png\" alt=\"Finding clusters with two variables\" class=\"wp-image-983\" title=\"\" srcset=\"https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/image-106.png 720w, https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/image-106-300x109.png 300w\" sizes=\"(max-width: 720px) 100vw, 720px\" \/><\/figure><\/div>\n\n\n<p>Of course, you could create groups based on your prior knowledge. For example, millennials with high incomes would be one of the groups. But you decide to dive deep to see if there are other ways to cluster them. Hence you decide to perform K-means with the two available variables.<\/p>\n\n\n\n<p>Doing it using Python is effortless. You only need a few lines of code to read the data and perform K-means clustering.<\/p>\n\n\n\n<p><script><br \/>\n            \/* This is ugly, but to ensure we only create this function once, and only call<br \/>\n            each JS library once, we need to check here in case there are multiple Wagtail<br \/>\n            blocks on this page. This will ensure we only load the minimum payload. *\/<br \/>\n            if(typeof loadPrismLanguage != 'function') {<br \/>\n                window.loadPrismLanguage = function(language) {<br \/>\n                    var libraries = [<br \/>\n                        {<br \/>\n                            \"id\": \"code-block-prismjs\",<br \/>\n                            \"url\": \"\/\/cdnjs.cloudflare.com\/ajax\/libs\/prism\/1.25.0\/prism.min.js\"<br \/>\n                        },<br \/>\n                        {<br \/>\n                            \"id\": \"code-block-prismjs-\" + language,<br \/>\n                            \"url\": \"\/\/cdnjs.cloudflare.com\/ajax\/libs\/prism\/1.25.0\/components\/prism-\" + language + \".min.js\"<br \/>\n                        },<br \/>\n        {<br \/>\n            \"id\": \"code-block-line-numbers\",<br \/>\n            \"url\": \"\/\/cdnjs.cloudflare.com\/ajax\/libs\/prism\/1.25.0\/plugins\/line-numbers\/prism-line-numbers.min.js\"<br \/>\n        }<\/p>\n<p>                    ];<\/p>\n<p>                    for(const library of libraries) {<br \/>\n                        if(document.getElementById(library[\"id\"]) == null) {<br \/>\n                            var s = document.createElement(\"script\");<br \/>\n                            s.id = library[\"id\"];<br \/>\n                            s.type = \"text\/javascript\";<br \/>\n                            s.src = library[\"url\"];<br \/>\n                            s.async = false;<br \/>\n                            document.body.appendChild(s);<br \/>\n                        }<br \/>\n                    }<br \/>\n                };<br \/>\n            }<\/p>\n<p>            loadPrismLanguage('python');<\/p>\n<p>            language_class_name = 'language-python';<br \/>\n            <\/script><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro padding-bottom-disabled\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:16px;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:20px\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"# Imports\nimport pandas as pd # Pandas for reading data\nfrom sklearn.cluster import KMeans # KMeans Clustering itself\n\n# Read dataset from file using pandas\ndf = pd.read_csv(&quot;.\/voters_demo_sample.csv&quot;)\n\n# perform K-Means clustering to find 2 clusters considering only age and income of voters\nkmeans = KMeans(n_clusters=2, random_state=0).fit(df[[&quot;Age&quot;, &quot;Income&quot;]])\n\n# See the cluster label for each data point \/ Group label of each voter\nkmeans.labels_\n\n# Identifying the (final) cluster centroids\nkmeans.cluster_centers_\n\n#Output\n# array([[56.05275779, 39.92805755],\n#        [30.55364807, 22.2360515 ]])\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #616E88\"># Imports<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> pandas <\/span><span style=\"color: #81A1C1\">as<\/span><span style=\"color: #D8DEE9FF\"> pd <\/span><span style=\"color: #616E88\"># Pandas for reading data<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">from<\/span><span style=\"color: #D8DEE9FF\"> sklearn<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">cluster <\/span><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> KMeans <\/span><span style=\"color: #616E88\"># KMeans Clustering itself<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Read dataset from file using pandas<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">df <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> pd<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">read_csv<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">.\/voters_demo_sample.csv<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># perform K-Means clustering to find 2 clusters considering only age and income of voters<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">kmeans <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">KMeans<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">n_clusters<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">random_state<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">).<\/span><span style=\"color: #88C0D0\">fit<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">df<\/span><span style=\"color: #ECEFF4\">[[<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Age<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Income<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">]])<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># See the cluster label for each data point \/ Group label of each voter<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">kmeans<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">labels_<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Identifying the (final) cluster centroids<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">kmeans<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">cluster_centers_<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#Output<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># array([[56.05275779, 39.92805755],<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#        [30.55364807, 22.2360515 ]])<\/span><\/span><\/code><\/pre><span style=\"display:flex;align-items:flex-end;padding:10px;width:100%;justify-content:flex-end;background-color:#2e3440ff;color:#c8d0e0;font-size:12px;line-height:1;position:relative\">Python<\/span><\/div>\n\n\n\n<p><script><br \/>\n                var block_num = (typeof block_num === 'undefined') ? 0 : block_num;<br \/>\n                block_num++;<br \/>\n                document.getElementById('target-element-current').className = language_class_name;<br \/>\n                document.getElementById('target-element-current').id = 'target-element-' + block_num;<br \/>\n            <\/script><\/p>\n\n\n\n<p>You now know what the two more prominent groups are to consider and their average age and income. You also know to which group each individual belongs.<\/p>\n\n\n\n<p>That\u2019s incredible. But we need to see it with our own eyes to understand it better, don\u2019t we? Again this is straightforward with a few lines of codes.<\/p>\n\n\n\n<p><script><br \/>\n            \/* This is ugly, but to ensure we only create this function once, and only call<br \/>\n            each JS library once, we need to check here in case there are multiple Wagtail<br \/>\n            blocks on this page. This will ensure we only load the minimum payload. *\/<br \/>\n            if(typeof loadPrismLanguage != 'function') {<br \/>\n                window.loadPrismLanguage = function(language) {<br \/>\n                    var libraries = [<br \/>\n                        {<br \/>\n                            \"id\": \"code-block-prismjs\",<br \/>\n                            \"url\": \"\/\/cdnjs.cloudflare.com\/ajax\/libs\/prism\/1.25.0\/prism.min.js\"<br \/>\n                        },<br \/>\n                        {<br \/>\n                            \"id\": \"code-block-prismjs-\" + language,<br \/>\n                            \"url\": \"\/\/cdnjs.cloudflare.com\/ajax\/libs\/prism\/1.25.0\/components\/prism-\" + language + \".min.js\"<br \/>\n                        },<br \/>\n        {<br \/>\n            \"id\": \"code-block-line-numbers\",<br \/>\n            \"url\": \"\/\/cdnjs.cloudflare.com\/ajax\/libs\/prism\/1.25.0\/plugins\/line-numbers\/prism-line-numbers.min.js\"<br \/>\n        }<\/p>\n<p>                    ];<\/p>\n<p>                    for(const library of libraries) {<br \/>\n                        if(document.getElementById(library[\"id\"]) == null) {<br \/>\n                            var s = document.createElement(\"script\");<br \/>\n                            s.id = library[\"id\"];<br \/>\n                            s.type = \"text\/javascript\";<br \/>\n                            s.src = library[\"url\"];<br \/>\n                            s.async = false;<br \/>\n                            document.body.appendChild(s);<br \/>\n                        }<br \/>\n                    }<br \/>\n                };<br \/>\n            }<\/p>\n<p>            loadPrismLanguage('python');<\/p>\n<p>            language_class_name = 'language-python';<br \/>\n            <\/script><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro padding-bottom-disabled\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:16px;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:20px\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"# Imports for visualiation\nimport seaborn as sns\n\n# Create scatterplot\nax = sns.scatterplot(\n    x=df.Age,\n    y=df.Income,\n    hue=kmeans.labels_,\n    palette=sns.color_palette(&quot;colorblind&quot;, n_colors=2),\n    legend=None,\n)\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #616E88\"># Imports for visualiation<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> seaborn <\/span><span style=\"color: #81A1C1\">as<\/span><span style=\"color: #D8DEE9FF\"> sns<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Create scatterplot<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">ax <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> sns<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">scatterplot<\/span><span style=\"color: #ECEFF4\">(<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #D8DEE9\">x<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\">df<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">Age<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #D8DEE9\">y<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\">df<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">Income<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #D8DEE9\">hue<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\">kmeans<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">labels_<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #D8DEE9\">palette<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\">sns<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">color_palette<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">colorblind<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">n_colors<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #D8DEE9\">legend<\/span><span style=\"color: #81A1C1\">=None<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">)<\/span><\/span><\/code><\/pre><span style=\"display:flex;align-items:flex-end;padding:10px;width:100%;justify-content:flex-end;background-color:#2e3440ff;color:#c8d0e0;font-size:12px;line-height:1;position:relative\">Python<\/span><\/div>\n\n\n\n<p><script><br \/>\n                var block_num = (typeof block_num === 'undefined') ? 0 : block_num;<br \/>\n                block_num++;<br \/>\n                document.getElementById('target-element-current').className = language_class_name;<br \/>\n                document.getElementById('target-element-current').id = 'target-element-' + block_num;<br \/>\n            <\/script><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"720\" height=\"405\" src=\"https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/image-107.png\" alt=\"K mean clustering using 2 variables\" class=\"wp-image-984\" title=\"\" srcset=\"https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/image-107.png 720w, https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/image-107-300x169.png 300w\" sizes=\"(max-width: 720px) 100vw, 720px\" \/><\/figure><\/div>\n\n\n<p>Visualizing helps. We could see the clusters are well separated, and their cluster centroids are representative.<\/p>\n\n\n\n<p>An interesting question to ask at this point is how do we know that there are only two clusters? Could there be more? Let\u2019s repeat the same, assuming the dataset has three groups.<\/p>\n\n\n\n<p><script><br \/>\n            \/* This is ugly, but to ensure we only create this function once, and only call<br \/>\n            each JS library once, we need to check here in case there are multiple Wagtail<br \/>\n            blocks on this page. This will ensure we only load the minimum payload. *\/<br \/>\n            if(typeof loadPrismLanguage != 'function') {<br \/>\n                window.loadPrismLanguage = function(language) {<br \/>\n                    var libraries = [<br \/>\n                        {<br \/>\n                            \"id\": \"code-block-prismjs\",<br \/>\n                            \"url\": \"\/\/cdnjs.cloudflare.com\/ajax\/libs\/prism\/1.25.0\/prism.min.js\"<br \/>\n                        },<br \/>\n                        {<br \/>\n                            \"id\": \"code-block-prismjs-\" + language,<br \/>\n                            \"url\": \"\/\/cdnjs.cloudflare.com\/ajax\/libs\/prism\/1.25.0\/components\/prism-\" + language + \".min.js\"<br \/>\n                        },<br \/>\n        {<br \/>\n            \"id\": \"code-block-line-numbers\",<br \/>\n            \"url\": \"\/\/cdnjs.cloudflare.com\/ajax\/libs\/prism\/1.25.0\/plugins\/line-numbers\/prism-line-numbers.min.js\"<br \/>\n        }<\/p>\n<p>                    ];<\/p>\n<p>                    for(const library of libraries) {<br \/>\n                        if(document.getElementById(library[\"id\"]) == null) {<br \/>\n                            var s = document.createElement(\"script\");<br \/>\n                            s.id = library[\"id\"];<br \/>\n                            s.type = \"text\/javascript\";<br \/>\n                            s.src = library[\"url\"];<br \/>\n                            s.async = false;<br \/>\n                            document.body.appendChild(s);<br \/>\n                        }<br \/>\n                    }<br \/>\n                };<br \/>\n            }<\/p>\n<p>            loadPrismLanguage('python');<\/p>\n<p>            language_class_name = 'language-python';<br \/>\n            <\/script><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro padding-bottom-disabled\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:16px;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:20px\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"# Performing K-Means to find 3 clusters using the same two variables\nkmeans = KMeans(n_clusters=3, random_state=0).fit(df[[&quot;Age&quot;, &quot;Income&quot;]])\n\nax = sns.scatterplot(\n    x=df.Age,\n    y=df.Income,\n    hue=kmeans.labels_,\n    palette=sns.color_palette(&quot;colorblind&quot;, n_colors=3),\n    legend=None,\n)\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #616E88\"># Performing K-Means to find 3 clusters using the same two variables<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">kmeans <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">KMeans<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">n_clusters<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">random_state<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">).<\/span><span style=\"color: #88C0D0\">fit<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">df<\/span><span style=\"color: #ECEFF4\">[[<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Age<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Income<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">]])<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">ax <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> sns<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">scatterplot<\/span><span style=\"color: #ECEFF4\">(<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #D8DEE9\">x<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\">df<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">Age<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #D8DEE9\">y<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\">df<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">Income<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #D8DEE9\">hue<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\">kmeans<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">labels_<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #D8DEE9\">palette<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\">sns<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">color_palette<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">colorblind<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">n_colors<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #D8DEE9\">legend<\/span><span style=\"color: #81A1C1\">=None<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">)<\/span><\/span><\/code><\/pre><span style=\"display:flex;align-items:flex-end;padding:10px;width:100%;justify-content:flex-end;background-color:#2e3440ff;color:#c8d0e0;font-size:12px;line-height:1;position:relative\">Python<\/span><\/div>\n\n\n\n<p><script><br \/>\n                var block_num = (typeof block_num === 'undefined') ? 0 : block_num;<br \/>\n                block_num++;<br \/>\n                document.getElementById('target-element-current').className = language_class_name;<br \/>\n                document.getElementById('target-element-current').id = 'target-element-' + block_num;<br \/>\n            <\/script><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"720\" height=\"405\" src=\"https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/image-108.png\" alt=\"K mean clustering using 2 variables - finding 3 clusters\" class=\"wp-image-985\" title=\"\" srcset=\"https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/image-108.png 720w, https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/image-108-300x169.png 300w\" sizes=\"(max-width: 720px) 100vw, 720px\" \/><\/figure><\/div>\n\n\n<p>Clustering with only two variables makes the visualization simple. But you could even create logical groups if there are only two features to consider. Can we add more features?<\/p>\n\n\n\n<p>Also, we could increase the number of clusters to any value. But what is the optimal number?<\/p>\n\n\n\n<p>We\u2019ll answer these questions in the coming sections. But before we move there, let\u2019s try to understand how the K-means algorithm works.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What\u2019s happening under the hood?<\/h2>\n\n\n\n<p>K-means uses distances between data points. Thus, data points that are closer to each other are likely to fall into the same cluster.<\/p>\n\n\n\n<p>In the beginning, K-means randomly chooses K data points as its cluster centroids. Here, K is an arbitrary number we defined. In our example, this <i>could be<\/i> age <b>25 and income of <\/b>20k (cluster A) and age 50 and income of <b>100k (cluster B)<\/b>.<\/p>\n\n\n\n<p>It then calculates the distance between all the data points in our dataset and these centroids. And the algorithm assigns each data point to the closest centroid. So that means a 20-year-old individual with 30k income would be in cluster A and a 60-year-old person with 80k income may be in cluster B.<\/p>\n\n\n\n<p>K-means calculates a new set of centroids by averaging all the data points of each cluster. Thus, the new cluster centroids of cluster A could be age 22 and 25k income. Centroids of cluster B may have changed as well. These will be the centroids for the next iteration.<\/p>\n\n\n\n<p>This process continues until there is no significant change in the cluster centroids.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What if you want to consider more features?<\/h2>\n\n\n\n<p>Finding clusters with age and income only seems straightforward. You may not need machine learning to do this. But, In reality, you\u2019ll have a bunch of other variables to work on. This may include their education level, health, and financial history.<\/p>\n\n\n\n<p>The two dimensions, however, helped us understand the algorithm. Also, it was easy to visualize with only two variables. But let\u2019s add a new variable to make it more realistic.<\/p>\n\n\n\n<p>Say you also have the debt levels of each individual. Isn\u2019t this a critical piece of information for political campaigns?<\/p>\n\n\n\n<p>You can do this without much alteration to the code you\u2019ve already written. Just add \u201cDebt\u201d to the list of variables you\u2019re passing.<\/p>\n\n\n\n<p><script><br \/>\n            \/* This is ugly, but to ensure we only create this function once, and only call<br \/>\n            each JS library once, we need to check here in case there are multiple Wagtail<br \/>\n            blocks on this page. This will ensure we only load the minimum payload. *\/<br \/>\n            if(typeof loadPrismLanguage != 'function') {<br \/>\n                window.loadPrismLanguage = function(language) {<br \/>\n                    var libraries = [<br \/>\n                        {<br \/>\n                            \"id\": \"code-block-prismjs\",<br \/>\n                            \"url\": \"\/\/cdnjs.cloudflare.com\/ajax\/libs\/prism\/1.25.0\/prism.min.js\"<br \/>\n                        },<br \/>\n                        {<br \/>\n                            \"id\": \"code-block-prismjs-\" + language,<br \/>\n                            \"url\": \"\/\/cdnjs.cloudflare.com\/ajax\/libs\/prism\/1.25.0\/components\/prism-\" + language + \".min.js\"<br \/>\n                        },<br \/>\n        {<br \/>\n            \"id\": \"code-block-line-numbers\",<br \/>\n            \"url\": \"\/\/cdnjs.cloudflare.com\/ajax\/libs\/prism\/1.25.0\/plugins\/line-numbers\/prism-line-numbers.min.js\"<br \/>\n        }<\/p>\n<p>                    ];<\/p>\n<p>                    for(const library of libraries) {<br \/>\n                        if(document.getElementById(library[\"id\"]) == null) {<br \/>\n                            var s = document.createElement(\"script\");<br \/>\n                            s.id = library[\"id\"];<br \/>\n                            s.type = \"text\/javascript\";<br \/>\n                            s.src = library[\"url\"];<br \/>\n                            s.async = false;<br \/>\n                            document.body.appendChild(s);<br \/>\n                        }<br \/>\n                    }<br \/>\n                };<br \/>\n            }<\/p>\n<p>            loadPrismLanguage('python');<\/p>\n<p>            language_class_name = 'language-python';<br \/>\n            <\/script><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro padding-bottom-disabled\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:16px;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:20px\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"# Performing K-Means to find 3 clusters using the three variables\nkmeans = KMeans(n_clusters=3, random_state=0).fit(df[[&quot;Age&quot;, &quot;Income&quot;, &quot;Debt&quot;]])\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #616E88\"># Performing K-Means to find 3 clusters using the three variables<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">kmeans <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">KMeans<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">n_clusters<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">random_state<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">).<\/span><span style=\"color: #88C0D0\">fit<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">df<\/span><span style=\"color: #ECEFF4\">[[<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Age<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Income<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Debt<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">]])<\/span><\/span><\/code><\/pre><span style=\"display:flex;align-items:flex-end;padding:10px;width:100%;justify-content:flex-end;background-color:#2e3440ff;color:#c8d0e0;font-size:12px;line-height:1;position:relative\">Python<\/span><\/div>\n\n\n\n<p><script><br \/>\n                var block_num = (typeof block_num === 'undefined') ? 0 : block_num;<br \/>\n                block_num++;<br \/>\n                document.getElementById('target-element-current').className = language_class_name;<br \/>\n                document.getElementById('target-element-current').id = 'target-element-' + block_num;<br \/>\n            <\/script><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Finding a correct value for K.<\/h2>\n\n\n\n<p>K-means finds the clusters, but it does not determine the number of sets; you have to define it yourself.<\/p>\n\n\n\n<p>If you\u2019ve used a smaller K value, you might not address the correct issues of voters. Your promises may seem superficial because people with different needs are now in the same cluster.<\/p>\n\n\n\n<p>On the other hand, if you have used a large number for K, campaigns will be costly. As a result, your party may end up pleasing every single niche that is worthless.<\/p>\n\n\n\n<p>But how do you know the arbitrary number you came up with is the right fit?<\/p>\n\n\n\n<p>The elbow method could help you find this.<\/p>\n\n\n\n<p>We use the Sum of Squared distances (SSE) between data points and their cluster centroids to determine the K value in this method. We call this value inertia. With more clusters (larger K, ) inertia will always decrease as the distance to the nearest clusters decreases.<\/p>\n\n\n\n<p>Yet, at some point, the reduction becomes insignificant. Hence, we conclude the number of clusters at this point is the correct value for K.<\/p>\n\n\n\n<p>Here is the Python code snippet to do this.<\/p>\n\n\n\n<p><script><br \/>\n            \/* This is ugly, but to ensure we only create this function once, and only call<br \/>\n            each JS library once, we need to check here in case there are multiple Wagtail<br \/>\n            blocks on this page. This will ensure we only load the minimum payload. *\/<br \/>\n            if(typeof loadPrismLanguage != 'function') {<br \/>\n                window.loadPrismLanguage = function(language) {<br \/>\n                    var libraries = [<br \/>\n                        {<br \/>\n                            \"id\": \"code-block-prismjs\",<br \/>\n                            \"url\": \"\/\/cdnjs.cloudflare.com\/ajax\/libs\/prism\/1.25.0\/prism.min.js\"<br \/>\n                        },<br \/>\n                        {<br \/>\n                            \"id\": \"code-block-prismjs-\" + language,<br \/>\n                            \"url\": \"\/\/cdnjs.cloudflare.com\/ajax\/libs\/prism\/1.25.0\/components\/prism-\" + language + \".min.js\"<br \/>\n                        },<br \/>\n        {<br \/>\n            \"id\": \"code-block-line-numbers\",<br \/>\n            \"url\": \"\/\/cdnjs.cloudflare.com\/ajax\/libs\/prism\/1.25.0\/plugins\/line-numbers\/prism-line-numbers.min.js\"<br \/>\n        }<\/p>\n<p>                    ];<\/p>\n<p>                    for(const library of libraries) {<br \/>\n                        if(document.getElementById(library[\"id\"]) == null) {<br \/>\n                            var s = document.createElement(\"script\");<br \/>\n                            s.id = library[\"id\"];<br \/>\n                            s.type = \"text\/javascript\";<br \/>\n                            s.src = library[\"url\"];<br \/>\n                            s.async = false;<br \/>\n                            document.body.appendChild(s);<br \/>\n                        }<br \/>\n                    }<br \/>\n                };<br \/>\n            }<\/p>\n<p>            loadPrismLanguage('python');<\/p>\n<p>            language_class_name = 'language-python';<br \/>\n            <\/script><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro padding-bottom-disabled\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:16px;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:20px\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"errors = [] # Create an empty list to collect inertias\n\n# Loop through some desirable number of clusters\n# The KMeans algorithm's fit method returns a property called inertia_ that has the information we need\nfor k in range(2, 10):\n    errors.append(\n        {\n            &quot;inertia&quot;: KMeans(n_clusters=k, random_state=0)\n            .fit(df[[&quot;Age&quot;, &quot;Income&quot;, &quot;Debt&quot;]])\n            .inertia_,\n            &quot;num_clusters&quot;: k,\n        }\n    )\n\n# for convenience convert the list to a pandas dataframe\ndf_inertia = pd.DataFrame(errors)\n\n# Create a line plot of inertia against number of clusters\nsns.lineplot(\n    x=df_inertia.num_clusters,\n    y=df_inertia.inertia,\n)\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D8DEE9FF\">errors <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">[]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #616E88\"># Create an empty list to collect inertias<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Loop through some desirable number of clusters<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># The KMeans algorithm&#39;s fit method returns a property called inertia_ that has the information we need<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> k <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">range<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">10<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    errors<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #ECEFF4\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">inertia<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">KMeans<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">n_clusters<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\">k<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">random_state<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">fit<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">df<\/span><span style=\"color: #ECEFF4\">[[<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Age<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Income<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Debt<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">]])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">inertia_<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">num_clusters<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> k<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #ECEFF4\">}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># for convenience convert the list to a pandas dataframe<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">df_inertia <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> pd<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">DataFrame<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">errors<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Create a line plot of inertia against number of clusters<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">sns<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">lineplot<\/span><span style=\"color: #ECEFF4\">(<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #D8DEE9\">x<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\">df_inertia<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">num_clusters<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #D8DEE9\">y<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\">df_inertia<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">inertia<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">)<\/span><\/span><\/code><\/pre><span style=\"display:flex;align-items:flex-end;padding:10px;width:100%;justify-content:flex-end;background-color:#2e3440ff;color:#c8d0e0;font-size:12px;line-height:1;position:relative\">Python<\/span><\/div>\n\n\n\n<p><script><br \/>\n                var block_num = (typeof block_num === 'undefined') ? 0 : block_num;<br \/>\n                block_num++;<br \/>\n                document.getElementById('target-element-current').className = language_class_name;<br \/>\n                document.getElementById('target-element-current').id = 'target-element-' + block_num;<br \/>\n            <\/script><\/p>\n\n\n\n<p>We fit K-means with different K\u2019s and collect the inertia values in a list. We then use this information to plot the graph below.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"720\" height=\"405\" src=\"https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/image-109.png\" alt=\"The elbow graph to find optimal number of clusters in K-Means\" class=\"wp-image-986\" title=\"\" srcset=\"https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/image-109.png 720w, https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/image-109-300x169.png 300w\" sizes=\"(max-width: 720px) 100vw, 720px\" \/><\/figure><\/div>\n\n\n<p>On this graph, it\u2019s clear that the drop in inertia isn\u2019t significant after three clusters.<\/p>\n\n\n\n<p>This information means that people who fall into the same cluster have similar characteristics if you group voters into three groups. You may plot three different strategies to address them all.<\/p>\n\n\n\n<p>Conversely, four clusters would make your campaigns expensive because each demands a different strategy. Two clusters could make the voters feel their needs aren\u2019t met.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Drawbacks of K-means clustering.<\/h2>\n\n\n\n<p>You may have a question about the initial cluster centroids of K means. If it\u2019s chosen randomly, wouldn\u2019t the clusters be different each time the algorithm runs?<\/p>\n\n\n\n<p>Absolutely, stumbling on a local minimum is a concern with the K-means algorithm.<\/p>\n\n\n\n<p>To solve this problem, you may have to run the algorithm several times and select the one that fits well. The total sum of squared distances between cluster centroids and their members could help us find the right fit. The lower this number, the better.<\/p>\n\n\n\n<p>The K-means algorithm is sensitive to outliers. Since the algorithm calculates new centroids at every iteration, outliers could distort it. Make sure to remove all outliers before you begin.<\/p>\n\n\n\n<p>But, these drawbacks are more manageable. What about some serious issues that make K-means unsuitable?<\/p>\n\n\n\n<p>K-means perform poorly for clusters with varying sizes or densities. Thus, if the dataset has more old-age people with high income than mid-age low-income earners, K means may end up with meaningless clusters.<\/p>\n\n\n\n<p>K means to perform well for linearly separable clusters. But how about other types of data? K-means is incapable of clustering different geometric shapes such as elliptical and circles.<\/p>\n\n\n\n<p>Nevertheless, K-means has <a href=\"https:\/\/www.the-analytics.club\/welcome-to-the-age-of-citizen-data-scientists\/\" title=\"Welcome to the Age of Citizen Data Scientists\">always been the favorite for data scientists<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">In Summary,<\/h2>\n\n\n\n<p>K-means is a simple and popular clustering algorithm. It groups data points in non-overlapping clusters.<\/p>\n\n\n\n<p>In this article, we\u2019ve discussed how to perform K-means clustering with Python. We started with two variables. Then we talked about adding more variables and finding a correct number for K.<\/p>\n\n\n\n<p>Throughout this article, we\u2019ve followed a hypothetical political campaign scenario. Although, in reality, you may have innumerable other variables, the basic steps are the same. Also, you can use the K-means algorithm in countless other applications. For example, you can use it to compress images as well as classify documents on your PC.<\/p>\n\n\n\n<p>But that doesn\u2019t mean it\u2019s the best all the time. We\u2019ve discussed several drawbacks of the K-means algorithm. For example, K-means doesn\u2019t do well on complicated geometric shapes. Algorithms such as DBSCAN could solve it better. But that\u2019s for another story.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dots\"\/>\n\n\n\n<blockquote class=\"wp-block-quote\">\n<p>Thanks for the read, friend. It seems you and I have lots of common interests. Say Hi to me on <a href=\"https:\/\/www.linkedin.com\/in\/thuwarakesh\/\" target=\"_blank\" rel=\"nofollow noopener\"><strong>LinkedIn<\/strong><\/a>, <a href=\"https:\/\/twitter.com\/Thuwarakesh\" target=\"_blank\" rel=\"nofollow noopener\"><strong>Twitter<\/strong><\/a>, and <a href=\"https:\/\/thuwarakesh.medium.com\/subscribe\" target=\"_blank\" rel=\"nofollow noopener\"><strong>Medium<\/strong><\/a>. <\/p>\n\n\n\n<p>Not a Medium member yet? Please use this link to <a href=\"https:\/\/thuwarakesh.medium.com\/membership\" target=\"_blank\" rel=\"nofollow noopener\"><strong>become a member<\/strong><\/a> because I earn a commission for referring at no extra cost for you.<\/p>\n<\/blockquote>\n","protected":false},"excerpt":{"rendered":"<p>Imagine that you are the chief campaign planner for the next presidential election. Thanks to the pandemic-driven technology adoption, campaigns are going online this time. Because it helps cover the\u2026<\/p>\n","protected":false},"author":2,"featured_media":218,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_kad_blocks_custom_css":"","_kad_blocks_head_custom_js":"","_kad_blocks_body_custom_js":"","_kad_blocks_footer_custom_js":"","_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"footnotes":""},"categories":[3,4,5],"tags":[],"taxonomy_info":{"category":[{"value":3,"label":"Python"},{"value":4,"label":"Data Science"},{"value":5,"label":"Programming"}]},"featured_image_src_large":["https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/running-election-campaigns-with-k-means-clustering-1024x576.jpg",1024,576,true],"author_info":{"display_name":"Thuwarakesh","author_link":"https:\/\/www.the-analytics.club\/author\/thuwarakesh\/"},"comment_info":0,"category_info":[{"term_id":3,"name":"Python","slug":"python","term_group":0,"term_taxonomy_id":3,"taxonomy":"category","description":"","parent":5,"count":52,"filter":"raw","cat_ID":3,"category_count":52,"category_description":"","cat_name":"Python","category_nicename":"python","category_parent":5},{"term_id":4,"name":"Data Science","slug":"data-science","term_group":0,"term_taxonomy_id":4,"taxonomy":"category","description":"","parent":0,"count":22,"filter":"raw","cat_ID":4,"category_count":22,"category_description":"","cat_name":"Data Science","category_nicename":"data-science","category_parent":0},{"term_id":5,"name":"Programming","slug":"programming","term_group":0,"term_taxonomy_id":5,"taxonomy":"category","description":"","parent":0,"count":43,"filter":"raw","cat_ID":5,"category_count":43,"category_description":"","cat_name":"Programming","category_nicename":"programming","category_parent":0}],"tag_info":false,"_links":{"self":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/382"}],"collection":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/comments?post=382"}],"version-history":[{"count":2,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/382\/revisions"}],"predecessor-version":[{"id":1306,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/382\/revisions\/1306"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/media\/218"}],"wp:attachment":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/media?parent=382"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/categories?post=382"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/tags?post=382"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}