{"id":371,"date":"2022-12-22T00:00:00","date_gmt":"2022-12-22T00:00:00","guid":{"rendered":"https:\/\/tac.debuzzify.com\/?p=371"},"modified":"2023-06-27T11:53:46","modified_gmt":"2023-06-27T11:53:46","slug":"5-pandas-performance-optimization-tips-without-crazy-setups","status":"publish","type":"post","link":"https:\/\/www.the-analytics.club\/5-pandas-performance-optimization-tips-without-crazy-setups\/","title":{"rendered":"5 Pandas Performance Optimization Tips Without Crazy Setups."},"content":{"rendered":"\n<p>5 Pandas Performance Optimization Tips Without Crazy Setups.<\/p>\n\n\n\n<p>Pandas is the de-facto data wrangling library in Python. It has excellent capabilities to slice and dice large dataframes. You&#8217;d quickly see the difference if you tried to do the same operation in Excel and Pandas.<\/p>\n\n\n\n<p>Most of the operations are pre-optimized with native C implementation under the hood. And makes vectorized operations much more effortless. Thus, even complex algorithms would do better in Pandas.<\/p>\n\n\n\n<p>Yet, it&#8217;s not the best of what Pandas could offer. Analysts should be aware of a few little tweaks to get the most out of it. This post is all about these little tips.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">5 Ways to Improve Pandas Performance<\/h2>\n\n\n\n<p>Here are five ways to improve the performance of Pandas:<\/p>\n\n\n\n<ol>\n<li>Use the <code>.loc<\/code> indexer instead of the <code>[]<\/code> indexer. The <code>.loc<\/code> indexer is faster because it accesses the data directly instead of returning a new object.<\/li>\n\n\n\n<li>Use the <code>numexpr<\/code> library for operations on large arrays. The <code>numexpr<\/code> library can significantly speed up processes on large arrays using optimized Cython routines.<\/li>\n\n\n\n<li>Use the <code>dtype<\/code> parameter when reading in a file. Specifying the data type of each column can significantly reduce the memory usage of the DataFrame.<\/li>\n\n\n\n<li>Use the <code>query<\/code> method for fast filtering. The <code>query<\/code> method is faster than Boolean indexing for filtering because it uses a quicker Cython-based implementation.<\/li>\n\n\n\n<li>Use the <code>numba<\/code> library to speed up certain operations. The <code>numba<\/code> library is a just-in-time compiler that can significantly speed up the execution of certain operations, especially on large arrays.<\/li>\n\n\n\n<li>Use the <code>swifter<\/code> library for parallelizing certain operations. The <code>swifter<\/code> library is a Pandas extension that can parallelize certain operations, such as <code>apply<\/code>, using multiple cores.<\/li>\n<\/ol>\n\n\n\n<p><b>Related: <\/b><a href=\"\/how-to-speed-up-python-data-pipelines-up-to-91x\"><b><i>How to Speed up Python Data Pipelines up to 91X?<\/i><\/b><\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Use .loc over [] indexers<\/h2>\n\n\n\n<p>The <code>.loc<\/code> indexer is faster than the <code>[]<\/code> indexer because it accesses the data directly instead of returning a new object.<\/p>\n\n\n\n<p>For example, consider the following DataFrame:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\ndf = pd.DataFrame({'A': &#91;1, 2, 3], 'B': &#91;4, 5, 6]})<\/code><\/pre>\n\n\n\n<p>To access a single element of the DataFrame using the <code>[]<\/code> Indexer, you would use the following syntax:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>df&#91;'A']&#91;0]<\/code><\/pre>\n\n\n\n<p>This syntax returns a new object that is a view of the data and then accesses the first element of that object.<\/p>\n\n\n\n<p>On the other hand, to access the same element using the <code>.loc<\/code> indexer, you would use the following syntax:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>df.loc&#91;0, 'A']<\/code><\/pre>\n\n\n\n<p>This syntax accesses the element directly without returning a new object.<\/p>\n\n\n\n<p>In general, the <code>.loc<\/code> indexer is faster because it avoids the overhead of returning a new object. This is especially beneficial when accessing large datasets, as the overhead of returning a new object can become significant.<\/p>\n\n\n\n<p>Here&#8217;s an example of how you can use the <code>%timeit<\/code> magic command in Jupyter to compare the performance of the <code>[]<\/code> and <code>.loc<\/code> indexers:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>%timeit df&#91;'A']&#91;0]\n%timeit df.loc&#91;0, 'A']<\/code><\/pre>\n\n\n\n<p>You should see that the .loc indexer is significantly faster on a large dataset.<\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The numexpr library can improve Pandas performance.<\/h2>\n\n\n\n<p>The <code>numexpr<\/code> library is a fast numerical expression evaluator for NumPy arrays. It can significantly speed up operations on large arrays by using optimized Cython routines.<\/p>\n\n\n\n<p>To use the <code>numexpr<\/code> library with Pandas, you can set the <code>numexpr.evaluate<\/code> the option to <code>True<\/code>. This will cause Pandas to use the <code>numexpr<\/code> library to evaluate certain operations.<\/p>\n\n\n\n<p>Here&#8217;s an example of how to use the <code>numexpr<\/code> library to improve the performance of a Pandas operation using synthetic data:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\nimport numpy as np\nimport numexpr\n\n# Set the numexpr.evaluate option to True\npd.options.compute.use_numexpr = True\n\n# Create a large DataFrame with synthetic data\ndf = pd.DataFrame(np.random.randn(1000000, 1000))\n\n# Use the .mean() method to compute the mean of each column\n%timeit df.mean()<\/code><\/pre>\n\n\n\n<p>Without the <code>numexpr<\/code> library, this operation would be relatively slow. However, with the <code>numexpr<\/code> library, the operation should be significantly faster.<\/p>\n\n\n\n<p>You can use the <code>%timeit<\/code> magic command to compare the performance of the operation with and without the <code>numexpr<\/code> library. Set the <code>pd.options.compute.use_numexpr<\/code> option to <code>False<\/code> and run the <code>%timeit<\/code> command again to see the difference in performance.<\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p><b>Related: <\/b><a href=\"\/speed-up-slow-for-loops-in-python\"><b><i>Is Your Python For-loop Slow? Use NumPy Instead<\/i><\/b><\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Specify dtype explicitly to reduce memory consumption.<\/h2>\n\n\n\n<p>When reading in a file using Pandas, you can specify the data type of each column using the <code>dtype<\/code> parameter. Specifying the data type of each column can significantly reduce the memory usage of the resulting DataFrame.<\/p>\n\n\n\n<p>By default, Pandas will infer the data type of each column based on the data it contains. However, this can result in unnecessarily large memory usage, especially if the data contains a mix of data types. For example, suppose a column contains both string and numeric data. In that case, Pandas will infer the data type to be <code>object<\/code>, which can use significantly more memory than a numeric data type.<\/p>\n\n\n\n<p>To reduce the memory usage of the DataFrame, you can specify the data type of each column explicitly using the <code>dtype<\/code> parameter. For example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\n# Read in a file with the dtype parameter specified\ndf = pd.read_csv('my_file.csv', dtype={'column_1': 'float64', 'column_2': 'object'})<\/code><\/pre>\n\n\n\n<p>In this example, the <code>column_1<\/code> column is specified as a <code>float64<\/code> data type, and the <code>column_2<\/code> column is specified as an <code>object<\/code> data type. This can significantly reduce the memory usage of the DataFrame compared to inferring the data types automatically.<\/p>\n\n\n\n<p>It&#8217;s important to note that specifying the data type correctly is crucial for reducing memory usage. If you specify the wrong data type, you may use more memory than if you had let Pandas infer the data type automatically.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Use the query method to filter dataframes<\/h2>\n\n\n\n<p>The query <a href=\"https:\/\/www.the-analytics.club\/agg-method-vs-apply-pandas\/\" title=\"Use agg() Method Over apply() To Accumulate Pandas Dataframes Faster.\">method in Pandas is a faster way to filter a DataFrame<\/a> than Boolean indexing. This is because the <a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.query.html\" target=\"_blank\" rel=\"noopener\">query method<\/a> uses a faster Cython-based implementation, whereas Boolean indexing uses slower Python-based operations.<\/p>\n\n\n\n<p>Here&#8217;s an example of how to use the <code>query<\/code> method to filter a DataFrame:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\n# Create a sample DataFrame\ndf = pd.DataFrame({'A': &#91;1, 2, 3], 'B': &#91;4, 5, 6]})\n\n# Use the query method to filter the DataFrame\nfiltered_df = df.query('A &gt; 1')\n\nprint(filtered_df)<\/code><\/pre>\n\n\n\n<p>This will output the following DataFrame:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>A  B\n1  2  5\n2  3  6<\/code><\/pre>\n\n\n\n<p>To compare the performance of the <code>query<\/code> method with Boolean indexing, you can use the <code>%timeit<\/code> magic command in Jupyter. Here&#8217;s an example of how to do this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>%timeit df&#91;df&#91;'A'] &gt; 1]\n%timeit df.query('A &gt; 1')<\/code><\/pre>\n\n\n\n<p>You should see that the query method is significantly faster than Boolean indexing on a large dataset.<\/p>\n\n\n\n<p>The <code>query<\/code> method is not always the fastest way to filter a DataFrame. For example, if you are filtering on a single column and the data type is integer or boolean, it may be faster to use Boolean indexing. However, the query method is generally a good choice for the fast-filtering DataFrames.<\/p>\n\n\n\n<p><b>Related: <\/b><a href=\"\/how-to-serve-massive-computations-using-python-web-apps\"><b><i>How to Serve Massive Computations Using Python Web Apps.<\/i><\/b><\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Use Numba to run Pandas operations faster.<\/h2>\n\n\n\n<p>The <a href=\"https:\/\/numba.pydata.org\/\" target=\"_blank\" rel=\"noopener\">numba library<\/a> is a just-in-time compiler that can significantly speed up the execution of certain operations, especially on large arrays. To use <code>numba<\/code> with Pandas, you can use the <code>numba.jit<\/code> decorator to decorate a function that performs a Pandas operation.<\/p>\n\n\n\n<p>Here&#8217;s an example of how to use <code>numba<\/code> to speed up a Pandas operation:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\nimport numba\n\n# Create a sample DataFrame\ndf = pd.DataFrame({'A': &#91;1, 2, 3], 'B': &#91;4, 5, 6]})\n\n# Define a function that performs a Pandas operation\n@numba.jit\ndef my_func(df):\n    return df&#91;'A'] + df&#91;'B']\n\n# Use the decorated function\nresult = my_func(df)\n\nprint(result)<\/code><\/pre>\n\n\n\n<p>This will output the following Series:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>0    5\n1    7\n2    9\ndtype: int64<\/code><\/pre>\n\n\n\n<p>Let&#8217;s compare the performance of the decorated function with the original function using <code>%timeit<\/code>. Here&#8217;s an example of how to do this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>%timeit df&#91;'A'] + df&#91;'B']\n%timeit my_func(df)<\/code><\/pre>\n\n\n\n<p>On a large dataset, you should see that the decorated function is significantly faster than the original function.<\/p>\n\n\n\n<p>It&#8217;s important to note that <code>numba<\/code> may not always speed up Pandas operations; in some cases, it may even slow them down. Therefore, it&#8217;s important to test the performance of your decorated functions to ensure that they are actually faster.<\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Use Swifter to Parallelize tasks.<\/h2>\n\n\n\n<p>The <a href=\"https:\/\/github.com\/jmcarpenter2\/swifter\" target=\"_blank\" rel=\"noopener\">swifter library<\/a> is a Pandas extension that can parallelize certain operations, such as <code>apply<\/code>, using multiple cores. To use <code>swifter<\/code>, you can call the <code>.swifter.apply<\/code> method on a <a href=\"https:\/\/tac.debuzzify.com\/agg-method-vs-apply-pandas\/\" data-type=\"URL\" data-id=\"https:\/\/tac.debuzzify.com\/agg-method-vs-apply-pandas\/\" target=\"_blank\" rel=\"noopener\">Pandas DataFrame or Series instead of the <code>.apply<\/code><\/a> method.<\/p>\n\n\n\n<p>Here&#8217;s an example of how to use <code>swifter<\/code> to parallelize the <code>apply<\/code> operation:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\nimport swifter\n\n# Create a sample DataFrame\ndf = pd.DataFrame({'A': &#91;1, 2, 3], 'B': &#91;4, 5, 6]})\n\n# Use the swifter.apply method to parallelize the apply operation\nresult = df.swifter.apply(lambda x: x&#91;'A'] + x&#91;'B'])\n\nprint(result)<\/code><\/pre>\n\n\n\n<p>This will output the following Series:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>0    5\n1    7\n2    9\ndtype: int64<\/code><\/pre>\n\n\n\n<p>To compare the performance of the <code>swifter.apply<\/code> method with the regular <code>apply<\/code> method; you can use the <code>%timeit<\/code> magic command in Jupyter. Here&#8217;s an example of how to do this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>%timeit df.apply(lambda x: x&#91;'A'] + x&#91;'B'])\n%timeit df.swifter.apply(lambda x: x&#91;'A'] + x&#91;'B'])<\/code><\/pre>\n\n\n\n<p>On a large dataset, you should see that the <code>swifter.apply<\/code> method is significantly faster than the regular <code>apply<\/code> method.<\/p>\n\n\n\n<p>Like Numba, <code>swifter<\/code> also may not always speed up Pandas operations. Again you should test the performance of your <code>swifter<\/code> operations to ensure that they are actually faster.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Final thoughts<\/h2>\n\n\n\n<p>Big datasets are unavoidable today. Everything generates data. Everything from your watch to your local weather station.<\/p>\n\n\n\n<p>When analyzing large datasets in Pandas, you&#8217;d almost always bang your head against the wall for performance issues. In fact, Pandas library is already optimized for performance. It has excellent vectorized operations inbuilt and implementations in C.<\/p>\n\n\n\n<p>But this post has a few tips to improve its performance further. With these techniques, you can achieve better results faster.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>How to speed up Pandas data wrangling operations with simple tweaks.<\/p>\n","protected":false},"author":2,"featured_media":641,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_kad_blocks_custom_css":"","_kad_blocks_head_custom_js":"","_kad_blocks_body_custom_js":"","_kad_blocks_footer_custom_js":"","_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"footnotes":""},"categories":[7],"tags":[],"taxonomy_info":{"category":[{"value":7,"label":"Data Wrangling"}]},"featured_image_src_large":["https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/12838-1024x638.jpg",1024,638,true],"author_info":{"display_name":"Thuwarakesh","author_link":"https:\/\/www.the-analytics.club\/author\/thuwarakesh\/"},"comment_info":0,"category_info":[{"term_id":7,"name":"Data Wrangling","slug":"data-wrangling","term_group":0,"term_taxonomy_id":7,"taxonomy":"category","description":"","parent":4,"count":4,"filter":"raw","cat_ID":7,"category_count":4,"category_description":"","cat_name":"Data Wrangling","category_nicename":"data-wrangling","category_parent":4}],"tag_info":false,"_links":{"self":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/371"}],"collection":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/comments?post=371"}],"version-history":[{"count":2,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/371\/revisions"}],"predecessor-version":[{"id":1318,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/371\/revisions\/1318"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/media\/641"}],"wp:attachment":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/media?parent=371"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/categories?post=371"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/tags?post=371"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}