Man Riding Bicycle

5 Pandas Performance Optimization Tips Without Crazy Setups.

5 Pandas Performance Optimization Tips Without Crazy Setups.

Pandas is the de-facto data wrangling library in Python. It has excellent capabilities to slice and dice large dataframes. You’d quickly see the difference if you tried to do the same operation in Excel and Pandas.

Most of the operations are pre-optimized with native C implementation under the hood. And makes vectorized operations much more effortless. Thus, even complex algorithms would do better in Pandas.

Yet, it’s not the best of what Pandas could offer. Analysts should be aware of a few little tweaks to get the most out of it. This post is all about these little tips.

5 Ways to Improve Pandas Performance

Here are five ways to improve the performance of Pandas:

  1. Use the .loc indexer instead of the [] indexer. The .loc indexer is faster because it accesses the data directly instead of returning a new object.
  2. Use the numexpr library for operations on large arrays. The numexpr library can significantly speed up processes on large arrays using optimized Cython routines.
  3. Use the dtype parameter when reading in a file. Specifying the data type of each column can significantly reduce the memory usage of the DataFrame.
  4. Use the query method for fast filtering. The query method is faster than Boolean indexing for filtering because it uses a quicker Cython-based implementation.
  5. Use the numba library to speed up certain operations. The numba library is a just-in-time compiler that can significantly speed up the execution of certain operations, especially on large arrays.
  6. Use the swifter library for parallelizing certain operations. The swifter library is a Pandas extension that can parallelize certain operations, such as apply, using multiple cores.

Related: How to Speed up Python Data Pipelines up to 91X?

Use .loc over [] indexers

The .loc indexer is faster than the [] indexer because it accesses the data directly instead of returning a new object.

For example, consider the following DataFrame:

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

To access a single element of the DataFrame using the [] Indexer, you would use the following syntax:

df['A'][0]

This syntax returns a new object that is a view of the data and then accesses the first element of that object.

On the other hand, to access the same element using the .loc indexer, you would use the following syntax:

df.loc[0, 'A']

This syntax accesses the element directly without returning a new object.

In general, the .loc indexer is faster because it avoids the overhead of returning a new object. This is especially beneficial when accessing large datasets, as the overhead of returning a new object can become significant.

Here’s an example of how you can use the %timeit magic command in Jupyter to compare the performance of the [] and .loc indexers:

%timeit df['A'][0]
%timeit df.loc[0, 'A']

You should see that the .loc indexer is significantly faster on a large dataset.

 

The numexpr library can improve Pandas performance.

The numexpr library is a fast numerical expression evaluator for NumPy arrays. It can significantly speed up operations on large arrays by using optimized Cython routines.

To use the numexpr library with Pandas, you can set the numexpr.evaluate the option to True. This will cause Pandas to use the numexpr library to evaluate certain operations.

Here’s an example of how to use the numexpr library to improve the performance of a Pandas operation using synthetic data:

import pandas as pd
import numpy as np
import numexpr

# Set the numexpr.evaluate option to True
pd.options.compute.use_numexpr = True

# Create a large DataFrame with synthetic data
df = pd.DataFrame(np.random.randn(1000000, 1000))

# Use the .mean() method to compute the mean of each column
%timeit df.mean()

Without the numexpr library, this operation would be relatively slow. However, with the numexpr library, the operation should be significantly faster.

You can use the %timeit magic command to compare the performance of the operation with and without the numexpr library. Set the pd.options.compute.use_numexpr option to False and run the %timeit command again to see the difference in performance.

 

Related: Is Your Python For-loop Slow? Use NumPy Instead

Specify dtype explicitly to reduce memory consumption.

When reading in a file using Pandas, you can specify the data type of each column using the dtype parameter. Specifying the data type of each column can significantly reduce the memory usage of the resulting DataFrame.

By default, Pandas will infer the data type of each column based on the data it contains. However, this can result in unnecessarily large memory usage, especially if the data contains a mix of data types. For example, suppose a column contains both string and numeric data. In that case, Pandas will infer the data type to be object, which can use significantly more memory than a numeric data type.

To reduce the memory usage of the DataFrame, you can specify the data type of each column explicitly using the dtype parameter. For example:

import pandas as pd

# Read in a file with the dtype parameter specified
df = pd.read_csv('my_file.csv', dtype={'column_1': 'float64', 'column_2': 'object'})

In this example, the column_1 column is specified as a float64 data type, and the column_2 column is specified as an object data type. This can significantly reduce the memory usage of the DataFrame compared to inferring the data types automatically.

It’s important to note that specifying the data type correctly is crucial for reducing memory usage. If you specify the wrong data type, you may use more memory than if you had let Pandas infer the data type automatically.

Use the query method to filter dataframes

The query method in Pandas is a faster way to filter a DataFrame than Boolean indexing. This is because the query method uses a faster Cython-based implementation, whereas Boolean indexing uses slower Python-based operations.

Here’s an example of how to use the query method to filter a DataFrame:

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Use the query method to filter the DataFrame
filtered_df = df.query('A > 1')

print(filtered_df)

This will output the following DataFrame:

A  B
1  2  5
2  3  6

To compare the performance of the query method with Boolean indexing, you can use the %timeit magic command in Jupyter. Here’s an example of how to do this:

%timeit df[df['A'] > 1]
%timeit df.query('A > 1')

You should see that the query method is significantly faster than Boolean indexing on a large dataset.

The query method is not always the fastest way to filter a DataFrame. For example, if you are filtering on a single column and the data type is integer or boolean, it may be faster to use Boolean indexing. However, the query method is generally a good choice for the fast-filtering DataFrames.

Related: How to Serve Massive Computations Using Python Web Apps.

Use Numba to run Pandas operations faster.

The numba library is a just-in-time compiler that can significantly speed up the execution of certain operations, especially on large arrays. To use numba with Pandas, you can use the numba.jit decorator to decorate a function that performs a Pandas operation.

Here’s an example of how to use numba to speed up a Pandas operation:

import pandas as pd
import numba

# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Define a function that performs a Pandas operation
@numba.jit
def my_func(df):
    return df['A'] + df['B']

# Use the decorated function
result = my_func(df)

print(result)

This will output the following Series:

0    5
1    7
2    9
dtype: int64

Let’s compare the performance of the decorated function with the original function using %timeit. Here’s an example of how to do this:

%timeit df['A'] + df['B']
%timeit my_func(df)

On a large dataset, you should see that the decorated function is significantly faster than the original function.

It’s important to note that numba may not always speed up Pandas operations; in some cases, it may even slow them down. Therefore, it’s important to test the performance of your decorated functions to ensure that they are actually faster.

 

Use Swifter to Parallelize tasks.

The swifter library is a Pandas extension that can parallelize certain operations, such as apply, using multiple cores. To use swifter, you can call the .swifter.apply method on a Pandas DataFrame or Series instead of the .apply method.

Here’s an example of how to use swifter to parallelize the apply operation:

import pandas as pd
import swifter

# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Use the swifter.apply method to parallelize the apply operation
result = df.swifter.apply(lambda x: x['A'] + x['B'])

print(result)

This will output the following Series:

0    5
1    7
2    9
dtype: int64

To compare the performance of the swifter.apply method with the regular apply method; you can use the %timeit magic command in Jupyter. Here’s an example of how to do this:

%timeit df.apply(lambda x: x['A'] + x['B'])
%timeit df.swifter.apply(lambda x: x['A'] + x['B'])

On a large dataset, you should see that the swifter.apply method is significantly faster than the regular apply method.

Like Numba, swifter also may not always speed up Pandas operations. Again you should test the performance of your swifter operations to ensure that they are actually faster.

Final thoughts

Big datasets are unavoidable today. Everything generates data. Everything from your watch to your local weather station.

When analyzing large datasets in Pandas, you’d almost always bang your head against the wall for performance issues. In fact, Pandas library is already optimized for performance. It has excellent vectorized operations inbuilt and implementations in C.

But this post has a few tips to improve its performance further. With these techniques, you can achieve better results faster.

Similar Posts