{"id":371,"date":"2022-12-22T00:00:00","date_gmt":"2022-12-22T00:00:00","guid":{"rendered":"https:\/\/tac.debuzzify.com\/?p=371"},"modified":"2023-06-27T11:53:46","modified_gmt":"2023-06-27T11:53:46","slug":"5-pandas-performance-optimization-tips-without-crazy-setups","status":"publish","type":"post","link":"https:\/\/www.the-analytics.club\/5-pandas-performance-optimization-tips-without-crazy-setups\/","title":{"rendered":"5 Pandas Performance Optimization Tips Without Crazy Setups."},"content":{"rendered":"\n

5 Pandas Performance Optimization Tips Without Crazy Setups.<\/p>\n\n\n\n

Pandas is the de-facto data wrangling library in Python. It has excellent capabilities to slice and dice large dataframes. You’d quickly see the difference if you tried to do the same operation in Excel and Pandas.<\/p>\n\n\n\n

Most of the operations are pre-optimized with native C implementation under the hood. And makes vectorized operations much more effortless. Thus, even complex algorithms would do better in Pandas.<\/p>\n\n\n\n

Yet, it’s not the best of what Pandas could offer. Analysts should be aware of a few little tweaks to get the most out of it. This post is all about these little tips.<\/p>\n\n\n\n

5 Ways to Improve Pandas Performance<\/h2>\n\n\n\n

Here are five ways to improve the performance of Pandas:<\/p>\n\n\n\n

    \n
  1. Use the .loc<\/code> indexer instead of the []<\/code> indexer. The .loc<\/code> indexer is faster because it accesses the data directly instead of returning a new object.<\/li>\n\n\n\n
  2. Use the numexpr<\/code> library for operations on large arrays. The numexpr<\/code> library can significantly speed up processes on large arrays using optimized Cython routines.<\/li>\n\n\n\n
  3. Use the dtype<\/code> parameter when reading in a file. Specifying the data type of each column can significantly reduce the memory usage of the DataFrame.<\/li>\n\n\n\n
  4. Use the query<\/code> method for fast filtering. The query<\/code> method is faster than Boolean indexing for filtering because it uses a quicker Cython-based implementation.<\/li>\n\n\n\n
  5. Use the numba<\/code> library to speed up certain operations. The numba<\/code> library is a just-in-time compiler that can significantly speed up the execution of certain operations, especially on large arrays.<\/li>\n\n\n\n
  6. Use the swifter<\/code> library for parallelizing certain operations. The swifter<\/code> library is a Pandas extension that can parallelize certain operations, such as apply<\/code>, using multiple cores.<\/li>\n<\/ol>\n\n\n\n

    Related: <\/b>How to Speed up Python Data Pipelines up to 91X?<\/i><\/b><\/a><\/p>\n\n\n\n

    Use .loc over [] indexers<\/h2>\n\n\n\n

    The .loc<\/code> indexer is faster than the []<\/code> indexer because it accesses the data directly instead of returning a new object.<\/p>\n\n\n\n

    For example, consider the following DataFrame:<\/p>\n\n\n\n

    import pandas as pd\n\ndf = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})<\/code><\/pre>\n\n\n\n

    To access a single element of the DataFrame using the []<\/code> Indexer, you would use the following syntax:<\/p>\n\n\n\n

    df['A'][0]<\/code><\/pre>\n\n\n\n

    This syntax returns a new object that is a view of the data and then accesses the first element of that object.<\/p>\n\n\n\n

    On the other hand, to access the same element using the .loc<\/code> indexer, you would use the following syntax:<\/p>\n\n\n\n

    df.loc[0, 'A']<\/code><\/pre>\n\n\n\n

    This syntax accesses the element directly without returning a new object.<\/p>\n\n\n\n

    In general, the .loc<\/code> indexer is faster because it avoids the overhead of returning a new object. This is especially beneficial when accessing large datasets, as the overhead of returning a new object can become significant.<\/p>\n\n\n\n

    Here’s an example of how you can use the %timeit<\/code> magic command in Jupyter to compare the performance of the []<\/code> and .loc<\/code> indexers:<\/p>\n\n\n\n

    %timeit df['A'][0]\n%timeit df.loc[0, 'A']<\/code><\/pre>\n\n\n\n

    You should see that the .loc indexer is significantly faster on a large dataset.<\/p>\n\n\n\n

     <\/p>\n\n\n\n

    The numexpr library can improve Pandas performance.<\/h2>\n\n\n\n

    The numexpr<\/code> library is a fast numerical expression evaluator for NumPy arrays. It can significantly speed up operations on large arrays by using optimized Cython routines.<\/p>\n\n\n\n

    To use the numexpr<\/code> library with Pandas, you can set the numexpr.evaluate<\/code> the option to True<\/code>. This will cause Pandas to use the numexpr<\/code> library to evaluate certain operations.<\/p>\n\n\n\n

    Here’s an example of how to use the numexpr<\/code> library to improve the performance of a Pandas operation using synthetic data:<\/p>\n\n\n\n

    import pandas as pd\nimport numpy as np\nimport numexpr\n\n# Set the numexpr.evaluate option to True\npd.options.compute.use_numexpr = True\n\n# Create a large DataFrame with synthetic data\ndf = pd.DataFrame(np.random.randn(1000000, 1000))\n\n# Use the .mean() method to compute the mean of each column\n%timeit df.mean()<\/code><\/pre>\n\n\n\n

    Without the numexpr<\/code> library, this operation would be relatively slow. However, with the numexpr<\/code> library, the operation should be significantly faster.<\/p>\n\n\n\n

    You can use the %timeit<\/code> magic command to compare the performance of the operation with and without the numexpr<\/code> library. Set the pd.options.compute.use_numexpr<\/code> option to False<\/code> and run the %timeit<\/code> command again to see the difference in performance.<\/p>\n\n\n\n

     <\/p>\n\n\n\n

    Related: <\/b>Is Your Python For-loop Slow? Use NumPy Instead<\/i><\/b><\/a><\/p>\n\n\n\n

    Specify dtype explicitly to reduce memory consumption.<\/h2>\n\n\n\n

    When reading in a file using Pandas, you can specify the data type of each column using the dtype<\/code> parameter. Specifying the data type of each column can significantly reduce the memory usage of the resulting DataFrame.<\/p>\n\n\n\n

    By default, Pandas will infer the data type of each column based on the data it contains. However, this can result in unnecessarily large memory usage, especially if the data contains a mix of data types. For example, suppose a column contains both string and numeric data. In that case, Pandas will infer the data type to be object<\/code>, which can use significantly more memory than a numeric data type.<\/p>\n\n\n\n

    To reduce the memory usage of the DataFrame, you can specify the data type of each column explicitly using the dtype<\/code> parameter. For example:<\/p>\n\n\n\n

    import pandas as pd\n\n# Read in a file with the dtype parameter specified\ndf = pd.read_csv('my_file.csv', dtype={'column_1': 'float64', 'column_2': 'object'})<\/code><\/pre>\n\n\n\n

    In this example, the column_1<\/code> column is specified as a float64<\/code> data type, and the column_2<\/code> column is specified as an object<\/code> data type. This can significantly reduce the memory usage of the DataFrame compared to inferring the data types automatically.<\/p>\n\n\n\n

    It’s important to note that specifying the data type correctly is crucial for reducing memory usage. If you specify the wrong data type, you may use more memory than if you had let Pandas infer the data type automatically.<\/p>\n\n\n\n

    Use the query method to filter dataframes<\/h2>\n\n\n\n

    The query method in Pandas is a faster way to filter a DataFrame<\/a> than Boolean indexing. This is because the query method<\/a> uses a faster Cython-based implementation, whereas Boolean indexing uses slower Python-based operations.<\/p>\n\n\n\n

    Here’s an example of how to use the query<\/code> method to filter a DataFrame:<\/p>\n\n\n\n

    import pandas as pd\n\n# Create a sample DataFrame\ndf = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})\n\n# Use the query method to filter the DataFrame\nfiltered_df = df.query('A > 1')\n\nprint(filtered_df)<\/code><\/pre>\n\n\n\n

    This will output the following DataFrame:<\/p>\n\n\n\n

    A  B\n1  2  5\n2  3  6<\/code><\/pre>\n\n\n\n

    To compare the performance of the query<\/code> method with Boolean indexing, you can use the %timeit<\/code> magic command in Jupyter. Here’s an example of how to do this:<\/p>\n\n\n\n

    %timeit df[df['A'] > 1]\n%timeit df.query('A > 1')<\/code><\/pre>\n\n\n\n

    You should see that the query method is significantly faster than Boolean indexing on a large dataset.<\/p>\n\n\n\n

    The query<\/code> method is not always the fastest way to filter a DataFrame. For example, if you are filtering on a single column and the data type is integer or boolean, it may be faster to use Boolean indexing. However, the query method is generally a good choice for the fast-filtering DataFrames.<\/p>\n\n\n\n

    Related: <\/b>How to Serve Massive Computations Using Python Web Apps.<\/i><\/b><\/a><\/p>\n\n\n\n

    Use Numba to run Pandas operations faster.<\/h2>\n\n\n\n

    The numba library<\/a> is a just-in-time compiler that can significantly speed up the execution of certain operations, especially on large arrays. To use numba<\/code> with Pandas, you can use the numba.jit<\/code> decorator to decorate a function that performs a Pandas operation.<\/p>\n\n\n\n

    Here’s an example of how to use numba<\/code> to speed up a Pandas operation:<\/p>\n\n\n\n

    import pandas as pd\nimport numba\n\n# Create a sample DataFrame\ndf = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})\n\n# Define a function that performs a Pandas operation\n@numba.jit\ndef my_func(df):\n    return df['A'] + df['B']\n\n# Use the decorated function\nresult = my_func(df)\n\nprint(result)<\/code><\/pre>\n\n\n\n

    This will output the following Series:<\/p>\n\n\n\n

    0    5\n1    7\n2    9\ndtype: int64<\/code><\/pre>\n\n\n\n

    Let’s compare the performance of the decorated function with the original function using %timeit<\/code>. Here’s an example of how to do this:<\/p>\n\n\n\n

    %timeit df['A'] + df['B']\n%timeit my_func(df)<\/code><\/pre>\n\n\n\n

    On a large dataset, you should see that the decorated function is significantly faster than the original function.<\/p>\n\n\n\n

    It’s important to note that numba<\/code> may not always speed up Pandas operations; in some cases, it may even slow them down. Therefore, it’s important to test the performance of your decorated functions to ensure that they are actually faster.<\/p>\n\n\n\n

     <\/p>\n\n\n\n

    Use Swifter to Parallelize tasks.<\/h2>\n\n\n\n

    The swifter library<\/a> is a Pandas extension that can parallelize certain operations, such as apply<\/code>, using multiple cores. To use swifter<\/code>, you can call the .swifter.apply<\/code> method on a Pandas DataFrame or Series instead of the .apply<\/code><\/a> method.<\/p>\n\n\n\n

    Here’s an example of how to use swifter<\/code> to parallelize the apply<\/code> operation:<\/p>\n\n\n\n

    import pandas as pd\nimport swifter\n\n# Create a sample DataFrame\ndf = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})\n\n# Use the swifter.apply method to parallelize the apply operation\nresult = df.swifter.apply(lambda x: x['A'] + x['B'])\n\nprint(result)<\/code><\/pre>\n\n\n\n

    This will output the following Series:<\/p>\n\n\n\n

    0    5\n1    7\n2    9\ndtype: int64<\/code><\/pre>\n\n\n\n

    To compare the performance of the swifter.apply<\/code> method with the regular apply<\/code> method; you can use the %timeit<\/code> magic command in Jupyter. Here’s an example of how to do this:<\/p>\n\n\n\n

    %timeit df.apply(lambda x: x['A'] + x['B'])\n%timeit df.swifter.apply(lambda x: x['A'] + x['B'])<\/code><\/pre>\n\n\n\n

    On a large dataset, you should see that the swifter.apply<\/code> method is significantly faster than the regular apply<\/code> method.<\/p>\n\n\n\n

    Like Numba, swifter<\/code> also may not always speed up Pandas operations. Again you should test the performance of your swifter<\/code> operations to ensure that they are actually faster.<\/p>\n\n\n\n

    Final thoughts<\/h2>\n\n\n\n

    Big datasets are unavoidable today. Everything generates data. Everything from your watch to your local weather station.<\/p>\n\n\n\n

    When analyzing large datasets in Pandas, you’d almost always bang your head against the wall for performance issues. In fact, Pandas library is already optimized for performance. It has excellent vectorized operations inbuilt and implementations in C.<\/p>\n\n\n\n

    But this post has a few tips to improve its performance further. With these techniques, you can achieve better results faster.<\/p>\n","protected":false},"excerpt":{"rendered":"

    How to speed up Pandas data wrangling operations with simple tweaks.<\/p>\n","protected":false},"author":2,"featured_media":641,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_kad_blocks_custom_css":"","_kad_blocks_head_custom_js":"","_kad_blocks_body_custom_js":"","_kad_blocks_footer_custom_js":"","_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"footnotes":""},"categories":[7],"tags":[],"taxonomy_info":{"category":[{"value":7,"label":"Data Wrangling"}]},"featured_image_src_large":["https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/12838-1024x638.jpg",1024,638,true],"author_info":{"display_name":"Thuwarakesh","author_link":"https:\/\/www.the-analytics.club\/author\/thuwarakesh\/"},"comment_info":0,"category_info":[{"term_id":7,"name":"Data Wrangling","slug":"data-wrangling","term_group":0,"term_taxonomy_id":7,"taxonomy":"category","description":"","parent":4,"count":4,"filter":"raw","cat_ID":7,"category_count":4,"category_description":"","cat_name":"Data Wrangling","category_nicename":"data-wrangling","category_parent":4}],"tag_info":false,"_links":{"self":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/371"}],"collection":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/comments?post=371"}],"version-history":[{"count":2,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/371\/revisions"}],"predecessor-version":[{"id":1318,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/371\/revisions\/1318"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/media\/641"}],"wp:attachment":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/media?parent=371"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/categories?post=371"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/tags?post=371"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}