{"id":371,"date":"2022-12-22T00:00:00","date_gmt":"2022-12-22T00:00:00","guid":{"rendered":"https:\/\/tac.debuzzify.com\/?p=371"},"modified":"2023-06-27T11:53:46","modified_gmt":"2023-06-27T11:53:46","slug":"5-pandas-performance-optimization-tips-without-crazy-setups","status":"publish","type":"post","link":"https:\/\/www.the-analytics.club\/5-pandas-performance-optimization-tips-without-crazy-setups\/","title":{"rendered":"5 Pandas Performance Optimization Tips Without Crazy Setups."},"content":{"rendered":"\n
5 Pandas Performance Optimization Tips Without Crazy Setups.<\/p>\n\n\n\n
Pandas is the de-facto data wrangling library in Python. It has excellent capabilities to slice and dice large dataframes. You’d quickly see the difference if you tried to do the same operation in Excel and Pandas.<\/p>\n\n\n\n
Most of the operations are pre-optimized with native C implementation under the hood. And makes vectorized operations much more effortless. Thus, even complex algorithms would do better in Pandas.<\/p>\n\n\n\n
Yet, it’s not the best of what Pandas could offer. Analysts should be aware of a few little tweaks to get the most out of it. This post is all about these little tips.<\/p>\n\n\n\n
Here are five ways to improve the performance of Pandas:<\/p>\n\n\n\n
.loc<\/code> indexer instead of the []<\/code> indexer. The .loc<\/code> indexer is faster because it accesses the data directly instead of returning a new object.<\/li>\n\n\n\n- Use the
numexpr<\/code> library for operations on large arrays. The numexpr<\/code> library can significantly speed up processes on large arrays using optimized Cython routines.<\/li>\n\n\n\n- Use the
dtype<\/code> parameter when reading in a file. Specifying the data type of each column can significantly reduce the memory usage of the DataFrame.<\/li>\n\n\n\n- Use the
query<\/code> method for fast filtering. The query<\/code> method is faster than Boolean indexing for filtering because it uses a quicker Cython-based implementation.<\/li>\n\n\n\n- Use the
numba<\/code> library to speed up certain operations. The numba<\/code> library is a just-in-time compiler that can significantly speed up the execution of certain operations, especially on large arrays.<\/li>\n\n\n\n- Use the
swifter<\/code> library for parallelizing certain operations. The swifter<\/code> library is a Pandas extension that can parallelize certain operations, such as apply<\/code>, using multiple cores.<\/li>\n<\/ol>\n\n\n\nRelated: <\/b>How to Speed up Python Data Pipelines up to 91X?<\/i><\/b><\/a><\/p>\n\n\n\nUse .loc over [] indexers<\/h2>\n\n\n\n
The .loc<\/code> indexer is faster than the []<\/code> indexer because it accesses the data directly instead of returning a new object.<\/p>\n\n\n\nFor example, consider the following DataFrame:<\/p>\n\n\n\n
import pandas as pd\n\ndf = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})<\/code><\/pre>\n\n\n\nTo access a single element of the DataFrame using the []<\/code> Indexer, you would use the following syntax:<\/p>\n\n\n\ndf['A'][0]<\/code><\/pre>\n\n\n\nThis syntax returns a new object that is a view of the data and then accesses the first element of that object.<\/p>\n\n\n\n
On the other hand, to access the same element using the .loc<\/code> indexer, you would use the following syntax:<\/p>\n\n\n\ndf.loc[0, 'A']<\/code><\/pre>\n\n\n\nThis syntax accesses the element directly without returning a new object.<\/p>\n\n\n\n
In general, the .loc<\/code> indexer is faster because it avoids the overhead of returning a new object. This is especially beneficial when accessing large datasets, as the overhead of returning a new object can become significant.<\/p>\n\n\n\nHere’s an example of how you can use the %timeit<\/code> magic command in Jupyter to compare the performance of the []<\/code> and .loc<\/code> indexers:<\/p>\n\n\n\n%timeit df['A'][0]\n%timeit df.loc[0, 'A']<\/code><\/pre>\n\n\n\nYou should see that the .loc indexer is significantly faster on a large dataset.<\/p>\n\n\n\n
<\/p>\n\n\n\n
The numexpr library can improve Pandas performance.<\/h2>\n\n\n\n
The numexpr<\/code> library is a fast numerical expression evaluator for NumPy arrays. It can significantly speed up operations on large arrays by using optimized Cython routines.<\/p>\n\n\n\nTo use the numexpr<\/code> library with Pandas, you can set the numexpr.evaluate<\/code> the option to True<\/code>. This will cause Pandas to use the numexpr<\/code> library to evaluate certain operations.<\/p>\n\n\n\nHere’s an example of how to use the numexpr<\/code> library to improve the performance of a Pandas operation using synthetic data:<\/p>\n\n\n\nimport pandas as pd\nimport numpy as np\nimport numexpr\n\n# Set the numexpr.evaluate option to True\npd.options.compute.use_numexpr = True\n\n# Create a large DataFrame with synthetic data\ndf = pd.DataFrame(np.random.randn(1000000, 1000))\n\n# Use the .mean() method to compute the mean of each column\n%timeit df.mean()<\/code><\/pre>\n\n\n\nWithout the numexpr<\/code> library, this operation would be relatively slow. However, with the numexpr<\/code> library, the operation should be significantly faster.<\/p>\n\n\n\nYou can use the %timeit<\/code> magic command to compare the performance of the operation with and without the numexpr<\/code> library. Set the pd.options.compute.use_numexpr<\/code> option to False<\/code> and run the %timeit<\/code> command again to see the difference in performance.<\/p>\n\n\n\n <\/p>\n\n\n\n
Related: <\/b>Is Your Python For-loop Slow? Use NumPy Instead<\/i><\/b><\/a><\/p>\n\n\n\nSpecify dtype explicitly to reduce memory consumption.<\/h2>\n\n\n\n
When reading in a file using Pandas, you can specify the data type of each column using the dtype<\/code> parameter. Specifying the data type of each column can significantly reduce the memory usage of the resulting DataFrame.<\/p>\n\n\n\nBy default, Pandas will infer the data type of each column based on the data it contains. However, this can result in unnecessarily large memory usage, especially if the data contains a mix of data types. For example, suppose a column contains both string and numeric data. In that case, Pandas will infer the data type to be object<\/code>, which can use significantly more memory than a numeric data type.<\/p>\n\n\n\nTo reduce the memory usage of the DataFrame, you can specify the data type of each column explicitly using the dtype<\/code> parameter. For example:<\/p>\n\n\n\nimport pandas as pd\n\n# Read in a file with the dtype parameter specified\ndf = pd.read_csv('my_file.csv', dtype={'column_1': 'float64', 'column_2': 'object'})<\/code><\/pre>\n\n\n\nIn this example, the column_1<\/code> column is specified as a float64<\/code> data type, and the column_2<\/code> column is specified as an object<\/code> data type. This can significantly reduce the memory usage of the DataFrame compared to inferring the data types automatically.<\/p>\n\n\n\nIt’s important to note that specifying the data type correctly is crucial for reducing memory usage. If you specify the wrong data type, you may use more memory than if you had let Pandas infer the data type automatically.<\/p>\n\n\n\n
Use the query method to filter dataframes<\/h2>\n\n\n\n
The query method in Pandas is a faster way to filter a DataFrame<\/a> than Boolean indexing. This is because the query method<\/a> uses a faster Cython-based implementation, whereas Boolean indexing uses slower Python-based operations.<\/p>\n\n\n\nHere’s an example of how to use the query<\/code> method to filter a DataFrame:<\/p>\n\n\n\nimport pandas as pd\n\n# Create a sample DataFrame\ndf = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})\n\n# Use the query method to filter the DataFrame\nfiltered_df = df.query('A > 1')\n\nprint(filtered_df)<\/code><\/pre>\n\n\n\nThis will output the following DataFrame:<\/p>\n\n\n\n
A B\n1 2 5\n2 3 6<\/code><\/pre>\n\n\n\nTo compare the performance of the query<\/code> method with Boolean indexing, you can use the %timeit<\/code> magic command in Jupyter. Here’s an example of how to do this:<\/p>\n\n\n\n%timeit df[df['A'] > 1]\n%timeit df.query('A > 1')<\/code><\/pre>\n\n\n\nYou should see that the query method is significantly faster than Boolean indexing on a large dataset.<\/p>\n\n\n\n
The query<\/code> method is not always the fastest way to filter a DataFrame. For example, if you are filtering on a single column and the data type is integer or boolean, it may be faster to use Boolean indexing. However, the query method is generally a good choice for the fast-filtering DataFrames.<\/p>\n\n\n\nRelated: <\/b>How to Serve Massive Computations Using Python Web Apps.<\/i><\/b><\/a><\/p>\n\n\n\nUse Numba to run Pandas operations faster.<\/h2>\n\n\n\n
The numba library<\/a> is a just-in-time compiler that can significantly speed up the execution of certain operations, especially on large arrays. To use numba<\/code> with Pandas, you can use the numba.jit<\/code> decorator to decorate a function that performs a Pandas operation.<\/p>\n\n\n\nHere’s an example of how to use numba<\/code> to speed up a Pandas operation:<\/p>\n\n\n\nimport pandas as pd\nimport numba\n\n# Create a sample DataFrame\ndf = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})\n\n# Define a function that performs a Pandas operation\n@numba.jit\ndef my_func(df):\n return df['A'] + df['B']\n\n# Use the decorated function\nresult = my_func(df)\n\nprint(result)<\/code><\/pre>\n\n\n\nThis will output the following Series:<\/p>\n\n\n\n
0 5\n1 7\n2 9\ndtype: int64<\/code><\/pre>\n\n\n\nLet’s compare the performance of the decorated function with the original function using %timeit<\/code>. Here’s an example of how to do this:<\/p>\n\n\n\n%timeit df['A'] + df['B']\n%timeit my_func(df)<\/code><\/pre>\n\n\n\nOn a large dataset, you should see that the decorated function is significantly faster than the original function.<\/p>\n\n\n\n
It’s important to note that numba<\/code> may not always speed up Pandas operations; in some cases, it may even slow them down. Therefore, it’s important to test the performance of your decorated functions to ensure that they are actually faster.<\/p>\n\n\n\n <\/p>\n\n\n\n
Use Swifter to Parallelize tasks.<\/h2>\n\n\n\n
The swifter library<\/a> is a Pandas extension that can parallelize certain operations, such as apply<\/code>, using multiple cores. To use swifter<\/code>, you can call the .swifter.apply<\/code> method on a Pandas DataFrame or Series instead of the .apply<\/code><\/a> method.<\/p>\n\n\n\nHere’s an example of how to use swifter<\/code> to parallelize the apply<\/code> operation:<\/p>\n\n\n\nimport pandas as pd\nimport swifter\n\n# Create a sample DataFrame\ndf = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})\n\n# Use the swifter.apply method to parallelize the apply operation\nresult = df.swifter.apply(lambda x: x['A'] + x['B'])\n\nprint(result)<\/code><\/pre>\n\n\n\nThis will output the following Series:<\/p>\n\n\n\n
0 5\n1 7\n2 9\ndtype: int64<\/code><\/pre>\n\n\n\nTo compare the performance of the swifter.apply<\/code> method with the regular apply<\/code> method; you can use the %timeit<\/code> magic command in Jupyter. Here’s an example of how to do this:<\/p>\n\n\n\n%timeit df.apply(lambda x: x['A'] + x['B'])\n%timeit df.swifter.apply(lambda x: x['A'] + x['B'])<\/code><\/pre>\n\n\n\nOn a large dataset, you should see that the swifter.apply<\/code> method is significantly faster than the regular apply<\/code> method.<\/p>\n\n\n\nLike Numba, swifter<\/code> also may not always speed up Pandas operations. Again you should test the performance of your swifter<\/code> operations to ensure that they are actually faster.<\/p>\n\n\n\nFinal thoughts<\/h2>\n\n\n\n
Big datasets are unavoidable today. Everything generates data. Everything from your watch to your local weather station.<\/p>\n\n\n\n
When analyzing large datasets in Pandas, you’d almost always bang your head against the wall for performance issues. In fact, Pandas library is already optimized for performance. It has excellent vectorized operations inbuilt and implementations in C.<\/p>\n\n\n\n
But this post has a few tips to improve its performance further. With these techniques, you can achieve better results faster.<\/p>\n","protected":false},"excerpt":{"rendered":"
How to speed up Pandas data wrangling operations with simple tweaks.<\/p>\n","protected":false},"author":2,"featured_media":641,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_kad_blocks_custom_css":"","_kad_blocks_head_custom_js":"","_kad_blocks_body_custom_js":"","_kad_blocks_footer_custom_js":"","_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"footnotes":""},"categories":[7],"tags":[],"taxonomy_info":{"category":[{"value":7,"label":"Data Wrangling"}]},"featured_image_src_large":["https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/12838-1024x638.jpg",1024,638,true],"author_info":{"display_name":"Thuwarakesh","author_link":"https:\/\/www.the-analytics.club\/author\/thuwarakesh\/"},"comment_info":0,"category_info":[{"term_id":7,"name":"Data Wrangling","slug":"data-wrangling","term_group":0,"term_taxonomy_id":7,"taxonomy":"category","description":"","parent":4,"count":4,"filter":"raw","cat_ID":7,"category_count":4,"category_description":"","cat_name":"Data Wrangling","category_nicename":"data-wrangling","category_parent":4}],"tag_info":false,"_links":{"self":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/371"}],"collection":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/comments?post=371"}],"version-history":[{"count":2,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/371\/revisions"}],"predecessor-version":[{"id":1318,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/371\/revisions\/1318"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/media\/641"}],"wp:attachment":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/media?parent=371"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/categories?post=371"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/tags?post=371"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}