5 Pandas Performance Optimization Tips Without Crazy Setups.
5 Pandas Performance Optimization Tips Without Crazy Setups.
Pandas is the de-facto data wrangling library in Python. It has excellent capabilities to slice and dice large dataframes. You’d quickly see the difference if you tried to do the same operation in Excel and Pandas.
Most of the operations are pre-optimized with native C implementation under the hood. And makes vectorized operations much more effortless. Thus, even complex algorithms would do better in Pandas.
Yet, it’s not the best of what Pandas could offer. Analysts should be aware of a few little tweaks to get the most out of it. This post is all about these little tips.
5 Ways to Improve Pandas Performance
Here are five ways to improve the performance of Pandas:
- Use the
.loc
indexer instead of the[]
indexer. The.loc
indexer is faster because it accesses the data directly instead of returning a new object. - Use the
numexpr
library for operations on large arrays. Thenumexpr
library can significantly speed up processes on large arrays using optimized Cython routines. - Use the
dtype
parameter when reading in a file. Specifying the data type of each column can significantly reduce the memory usage of the DataFrame. - Use the
query
method for fast filtering. Thequery
method is faster than Boolean indexing for filtering because it uses a quicker Cython-based implementation. - Use the
numba
library to speed up certain operations. Thenumba
library is a just-in-time compiler that can significantly speed up the execution of certain operations, especially on large arrays. - Use the
swifter
library for parallelizing certain operations. Theswifter
library is a Pandas extension that can parallelize certain operations, such asapply
, using multiple cores.
Related: How to Speed up Python Data Pipelines up to 91X?
Use .loc over [] indexers
The .loc
indexer is faster than the []
indexer because it accesses the data directly instead of returning a new object.
For example, consider the following DataFrame:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
To access a single element of the DataFrame using the []
Indexer, you would use the following syntax:
df['A'][0]
This syntax returns a new object that is a view of the data and then accesses the first element of that object.
On the other hand, to access the same element using the .loc
indexer, you would use the following syntax:
df.loc[0, 'A']
This syntax accesses the element directly without returning a new object.
In general, the .loc
indexer is faster because it avoids the overhead of returning a new object. This is especially beneficial when accessing large datasets, as the overhead of returning a new object can become significant.
Here’s an example of how you can use the %timeit
magic command in Jupyter to compare the performance of the []
and .loc
indexers:
%timeit df['A'][0]
%timeit df.loc[0, 'A']
You should see that the .loc indexer is significantly faster on a large dataset.
The numexpr library can improve Pandas performance.
The numexpr
library is a fast numerical expression evaluator for NumPy arrays. It can significantly speed up operations on large arrays by using optimized Cython routines.
To use the numexpr
library with Pandas, you can set the numexpr.evaluate
the option to True
. This will cause Pandas to use the numexpr
library to evaluate certain operations.
Here’s an example of how to use the numexpr
library to improve the performance of a Pandas operation using synthetic data:
import pandas as pd
import numpy as np
import numexpr
# Set the numexpr.evaluate option to True
pd.options.compute.use_numexpr = True
# Create a large DataFrame with synthetic data
df = pd.DataFrame(np.random.randn(1000000, 1000))
# Use the .mean() method to compute the mean of each column
%timeit df.mean()
Without the numexpr
library, this operation would be relatively slow. However, with the numexpr
library, the operation should be significantly faster.
You can use the %timeit
magic command to compare the performance of the operation with and without the numexpr
library. Set the pd.options.compute.use_numexpr
option to False
and run the %timeit
command again to see the difference in performance.
Related: Is Your Python For-loop Slow? Use NumPy Instead
Specify dtype explicitly to reduce memory consumption.
When reading in a file using Pandas, you can specify the data type of each column using the dtype
parameter. Specifying the data type of each column can significantly reduce the memory usage of the resulting DataFrame.
By default, Pandas will infer the data type of each column based on the data it contains. However, this can result in unnecessarily large memory usage, especially if the data contains a mix of data types. For example, suppose a column contains both string and numeric data. In that case, Pandas will infer the data type to be object
, which can use significantly more memory than a numeric data type.
To reduce the memory usage of the DataFrame, you can specify the data type of each column explicitly using the dtype
parameter. For example:
import pandas as pd
# Read in a file with the dtype parameter specified
df = pd.read_csv('my_file.csv', dtype={'column_1': 'float64', 'column_2': 'object'})
In this example, the column_1
column is specified as a float64
data type, and the column_2
column is specified as an object
data type. This can significantly reduce the memory usage of the DataFrame compared to inferring the data types automatically.
It’s important to note that specifying the data type correctly is crucial for reducing memory usage. If you specify the wrong data type, you may use more memory than if you had let Pandas infer the data type automatically.
Use the query method to filter dataframes
The query method in Pandas is a faster way to filter a DataFrame than Boolean indexing. This is because the query method uses a faster Cython-based implementation, whereas Boolean indexing uses slower Python-based operations.
Here’s an example of how to use the query
method to filter a DataFrame:
import pandas as pd
# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Use the query method to filter the DataFrame
filtered_df = df.query('A > 1')
print(filtered_df)
This will output the following DataFrame:
A B
1 2 5
2 3 6
To compare the performance of the query
method with Boolean indexing, you can use the %timeit
magic command in Jupyter. Here’s an example of how to do this:
%timeit df[df['A'] > 1]
%timeit df.query('A > 1')
You should see that the query method is significantly faster than Boolean indexing on a large dataset.
The query
method is not always the fastest way to filter a DataFrame. For example, if you are filtering on a single column and the data type is integer or boolean, it may be faster to use Boolean indexing. However, the query method is generally a good choice for the fast-filtering DataFrames.
Related: How to Serve Massive Computations Using Python Web Apps.
Use Numba to run Pandas operations faster.
The numba library is a just-in-time compiler that can significantly speed up the execution of certain operations, especially on large arrays. To use numba
with Pandas, you can use the numba.jit
decorator to decorate a function that performs a Pandas operation.
Here’s an example of how to use numba
to speed up a Pandas operation:
import pandas as pd
import numba
# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Define a function that performs a Pandas operation
@numba.jit
def my_func(df):
return df['A'] + df['B']
# Use the decorated function
result = my_func(df)
print(result)
This will output the following Series:
0 5
1 7
2 9
dtype: int64
Let’s compare the performance of the decorated function with the original function using %timeit
. Here’s an example of how to do this:
%timeit df['A'] + df['B']
%timeit my_func(df)
On a large dataset, you should see that the decorated function is significantly faster than the original function.
It’s important to note that numba
may not always speed up Pandas operations; in some cases, it may even slow them down. Therefore, it’s important to test the performance of your decorated functions to ensure that they are actually faster.
Use Swifter to Parallelize tasks.
The swifter library is a Pandas extension that can parallelize certain operations, such as apply
, using multiple cores. To use swifter
, you can call the .swifter.apply
method on a Pandas DataFrame or Series instead of the .apply
method.
Here’s an example of how to use swifter
to parallelize the apply
operation:
import pandas as pd
import swifter
# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Use the swifter.apply method to parallelize the apply operation
result = df.swifter.apply(lambda x: x['A'] + x['B'])
print(result)
This will output the following Series:
0 5
1 7
2 9
dtype: int64
To compare the performance of the swifter.apply
method with the regular apply
method; you can use the %timeit
magic command in Jupyter. Here’s an example of how to do this:
%timeit df.apply(lambda x: x['A'] + x['B'])
%timeit df.swifter.apply(lambda x: x['A'] + x['B'])
On a large dataset, you should see that the swifter.apply
method is significantly faster than the regular apply
method.
Like Numba, swifter
also may not always speed up Pandas operations. Again you should test the performance of your swifter
operations to ensure that they are actually faster.
Final thoughts
Big datasets are unavoidable today. Everything generates data. Everything from your watch to your local weather station.
When analyzing large datasets in Pandas, you’d almost always bang your head against the wall for performance issues. In fact, Pandas library is already optimized for performance. It has excellent vectorized operations inbuilt and implementations in C.
But this post has a few tips to improve its performance further. With these techniques, you can achieve better results faster.