Use agg() Method Over apply() To Accumulate Pandas Dataframes Faster.
Data scientists extensively use Pandas for data wrangling. Aggregation is a common task in data wrangling. But in Pandas, there is more than one way to do this.
One way is to use the apply
method (or map
if it’s a series) on a dataframe. This is what I usually do. But the .
agg
method is another option I often neglect.
For a one-time analysis, this isn’t a big deal. The difference is minuscule for smaller datasets.
Yet, when we work on a larger dataset, we must know the impact, especially if the operation is supposed to run repeatedly, as in a data pipeline.
This post discusses the difference and analyses its impact on an ordinary computer. I hope this will help you find the best option for your recurring data tasks.
Talking about Pandas, here are the top resources to boost your data-wrangling skills.
[Book] Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter 3rd Edition by Wes McKinney
[Online Course] Data Analysis with Pandas and Python by Boris PaskhaverThe above links are carefully curated affiliate links. I earn a commission for qualified purchases at no extra cost to you. But it never affects what we pick.
Related: A Better Way to Summarize Pandas Dataframes.
How to use the agg method to accumulate dataframe columns
Pandas agg() method allows you to simultaneously apply multiple functions to a DataFrame or Series. It can be used in various ways, depending on the desired output.
Some common ways to use the agg()
method include:
Applying a single function to all columns:
df.agg(np.mean)
Applying a single function to a specific column:
df['column_name'].agg(np.mean)
Applying different functions to different columns:
df.agg({'col1': np.mean, 'col2': np.sum, 'col3': np.std})
Applying different functions to different columns and renaming the resulting columns:
df.agg({'col1': ['mean', 'min'], 'col2': ['sum', 'max']})
Applying a custom function to a column:
def custom_function(x):
return x.mean() - x.min()
df['col1'].agg(custom_function)
These are just a few examples of how the agg()
the method can be used. You can also use the groupby()
function in conjunction with the agg()
method to apply functions to data groups within the DataFrame.
Related: How to Run SQL Queries on Pandas Data Frames?
Using the agg method with group by
The groupby()
function in Pandas allows you to group a DataFrame or Series by one or more columns and apply a function to each group. You can use the agg()
method in combination with the groupby()
function to apply multiple functions to the groups.
To illustrate this, let’s consider the following synthetic dataset:
import numpy as np
import pandas as pd
# Create a synthetic dataset with 10 million rows and 3 columns
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 100, size=(10000000, 3)), columns=['col1', 'col2', 'Category'])
Suppose we want to group the data by the Category
column and compute the mean and sum of each column for each group. We can do this using the groupby()
and agg()
functions as follows:
df.groupby('Category').agg({'col1': ['mean', 'sum'], 'col2': ['mean', 'sum']})
This would return a new DataFrame with one row for each unique category in the Category
column, two columns for the mean and sum of col1,
and two columns for the mean and sum of col2
.
You can also use the groupby()
function with the apply()
method to apply a custom function to each group. For example:
def custom_function(group):
return group.mean() - group.min()
df.groupby('Category').apply(custom_function)
Benefits of using agg() over apply()
The agg()
method and the apply()
method in Pandas help use functions to a DataFrame or Series. However, there are a few key differences between the two approaches:
- The
agg()
method is specifically designed for applying multiple functions to a DataFrame or Series at once, whereas theapply()
method is more flexible and can use a single function or a user-defined function to a DataFrame or Series. - The
agg()
method is generally faster than theapply()
method because it uses a more efficient implementation under the hood. This can be especially important when working with large datasets. - The
agg()
method has a more concise syntax, as you can specify multiple functions in a single line of code. This can make your code easier to read and maintain.
Overall, the agg()
method is generally a better choice if you want to apply multiple functions to a DataFrame or Series, and performance is a concern. The apply()
method is more flexible and beneficial if you use a custom function that cannot be achieved with the built-in aggregation functions provided by agg()
.
Related: Pandas Replace: The Faster and Better Approach to Change Values of a Column.
When to use apply() over .agg() for accumulation?
As mentioned earlier, the agg()
method is specifically designed for applying multiple functions to a DataFrame or Series at once, whereas the apply()
method is more flexible and can be used to apply a single function or a user-defined function to a DataFrame or Series.
If you only want to apply a single function to a DataFrame or Series, and that function is not one of the built-in aggregation functions provided by agg()
, then you can use the apply()
method.
For example, suppose you have a DataFrame with a column of strings and want to apply a custom function that counts the number of vowels in each string. You could do this using the apply()
method:
def count_vowels(x):
vowels = 'aeiouAEIOU'
return sum(c in vowels for c in x)
df['string_column'].apply(count_vowels)
In this case, it would not be possible to use the agg()
method because it does not have a built-in function for counting vowels.
On the other hand, if you want to apply multiple functions to a DataFrame or Series, then the agg()
method is generally a better choice because it is more concise and efficient.
For example, suppose you want to compute each column’s mean, sum, and standard deviation in a DataFrame. You could do this using the agg()
method:
df.agg({'col1': ['mean', 'sum', 'std'], 'col2': ['mean', 'sum', 'std']})
In this case, the apply() method would be more cumbersome, as you would need to define a custom function that computes all three statistics and apply it to each column separately.
Built-in functions provided by .agg()
The agg()
method in Pandas provides a number of built-in functions that you can use to aggregate data. These functions include:
'sum'
: computes the sum of the values in a column.'mean'
: computes the mean of the values in a column.'count'
: counts the number of non-NA/null values in a column.'min'
: computes the minimum value in a column.'max'
: computes the maximum value in a column.'median'
: computes the median of the values in a column.'std'
: computes the standard deviation of the values in a column.'var'
: computes the variance of the values in a column.'sem'
: computes the standard error of the mean of the values in a column.'first'
: returns the first non-NA/null value in a column.'last'
: returns the last non-NA/null value in a column.
Here is an example of how you can use these built-in functions with the agg()
method:
df.agg({'col1': ['sum', 'mean'], 'col2': ['min', 'max']})
This would compute the sum and mean of col1
and the minimum and maximum of col2
.
Performance improvement between .apply() and .agg() using synthetic dataset.
The apply()
method and the agg()
method in Pandas are useful for applying functions to a DataFrame or Series. However, the agg()
method is generally faster than the apply()
method because it uses a more efficient implementation under the hood.
Related: 5 Pandas Performance Optimization Tips Without Crazy Setups.
To illustrate the performance difference between the two methods, let’s consider the following synthetic dataset:
import numpy as np
import pandas as pd
# Create a synthetic dataset with 10 million rows and 2 columns
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 100, size=(10000000, 2)), columns=['col1', 'col2'])
Suppose we want to apply the mean()
function to both columns of this dataset. We can do this using the apply()
method. We’ll also use the time magic function to measure the performance.
%time df.apply(np.mean)
On my machine, this takes about 2.28 seconds to run.
Now let’s try the same thing using the agg()
method:
%time df.agg(np.mean)
This takes about 0.68 seconds to run on my machine, which is significantly faster than the apply()
method.
This performance difference can be even more pronounced for more complex custom functions or larger datasets. Therefore, if performance is a concern and you want to apply multiple functions to a DataFrame or Series, it is generally a good idea to use the agg()
method rather than the apply()
method.