Data scientists extensively use Pandas for data wrangling. Aggregation is a common task in data wrangling. But in Pandas, there is more than one way to do this.

One way is to use the `apply`

method (or `map`

if it's a series) on a dataframe. This is what I usually do. But the `.`

`agg`

method is another option I often neglect.

For a one-time analysis, this isn't a big deal. The difference is minuscule for smaller datasets.

Yet, when we work on a larger dataset, we must know the impact, especially if the operation is supposed to run repeatedly, as in a data pipeline.

This post discusses the difference and analyses its impact on an ordinary computer. I hope this will help you find the best option for your recurring data tasks.

Talking about Pandas, here are the top resources to boost your data-wrangling skills.

[Book]Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter3rd EditionbyWes McKinney

[Online Course]Data Analysis with Pandas and Python by Boris PaskhaverThe above links are carefully curated affiliate links. I earn a commission for qualified purchases at no extra cost to you. But it never affects what we pick.

**Related: ***A Better Way to Summarize Pandas Dataframes.*

## How to use the agg method to accumulate dataframe columns

Pandas agg() method allows you to simultaneously apply multiple functions to a DataFrame or Series. It can be used in various ways, depending on the desired output.

Some common ways to use the `agg()`

method include:

Applying a single function to all columns:

`df.agg(np.mean)`

Applying a single function to a specific column:

`df['column_name'].agg(np.mean)`

Applying different functions to different columns:

`df.agg({'col1': np.mean, 'col2': np.sum, 'col3': np.std})`

Applying different functions to different columns and renaming the resulting columns:

`df.agg({'col1': ['mean', 'min'], 'col2': ['sum', 'max']})`

Applying a custom function to a column:

```
def custom_function(x):
return x.mean() - x.min()
df['col1'].agg(custom_function)
```

These are just a few examples of how the `agg()`

method can be used. You can also use the `groupby()`

function in conjunction with the `agg()`

method to apply functions to data groups within the DataFrame.

**Related: ***How to Run SQL Queries on Pandas Data Frames?*

### Using the agg method with groupby

The `groupby()`

function in Pandas allows you to group a DataFrame or Series by one or more columns and apply a function to each group. You can use the `agg()`

method in combination with the `groupby()`

function to apply multiple functions to the groups.

To illustrate this, let's consider the following synthetic dataset:

```
import numpy as np
import pandas as pd
# Create a synthetic dataset with 10 million rows and 3 columns
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 100, size=(10000000, 3)), columns=['col1', 'col2', 'Category'])
```

Suppose we want to group the data by the `Category`

column and compute the mean and sum of each column for each group. We can do this using the `groupby()`

and `agg()`

functions as follows:

`df.groupby('Category').agg({'col1': ['mean', 'sum'], 'col2': ['mean', 'sum']})`

This would return a new DataFrame with one row for each unique category in the `Category`

column, two columns for the mean and sum of `col1,`

and two columns for the mean and sum of `col2`

.

You can also use the `groupby()`

function with the `apply()`

method to apply a custom function to each group. For example:

```
def custom_function(group):
return group.mean() - group.min()
df.groupby('Category').apply(custom_function)
```

## Benefits of using agg() over apply()

The `agg()`

method and the `apply()`

method in Pandas help use functions to a DataFrame or Series. However, there are a few key differences between the two approaches:

- The
`agg()`

method is specifically designed for applying multiple functions to a DataFrame or Series at once, whereas the`apply()`

method is more flexible and can use a single function or a user-defined function to a DataFrame or Series. - The
`agg()`

method is generally faster than the`apply()`

method because it uses a more efficient implementation under the hood. This can be especially important when working with large datasets. - The
`agg()`

method has a more concise syntax, as you can specify multiple functions in a single line of code. This can make your code easier to read and maintain.

Overall, the `agg()`

method is generally a better choice if you want to apply multiple functions to a DataFrame or Series, and performance is a concern. The `apply()`

method is more flexible and beneficial if you use a custom function that cannot be achieved with the built-in aggregation functions provided by `agg()`

.

**Related: ***Pandas Replace: The Faster and Better Approach to Change Values of a Column.*

## When to use apply() over .agg() for accumulation?

As mentioned earlier, the `agg()`

method is specifically designed for applying multiple functions to a DataFrame or Series at once, whereas the `apply()`

method is more flexible and can be used to apply a single function or a user-defined function to a DataFrame or Series.

If you only want to apply a single function to a DataFrame or Series, and that function is not one of the built-in aggregation functions provided by `agg()`

, then you can use the `apply()`

method.

For example, suppose you have a DataFrame with a column of strings and want to apply a custom function that counts the number of vowels in each string. You could do this using the `apply()`

method:

```
def count_vowels(x):
vowels = 'aeiouAEIOU'
return sum(c in vowels for c in x)
df['string_column'].apply(count_vowels)
```

In this case, it would not be possible to use the `agg()`

method because it does not have a built-in function for counting vowels.

On the other hand, if you want to apply multiple functions to a DataFrame or Series, then the `agg()`

method is generally a better choice because it is more concise and efficient.

For example, suppose you want to compute each column's mean, sum, and standard deviation in a DataFrame. You could do this using the `agg()`

method:

`df.agg({'col1': ['mean', 'sum', 'std'], 'col2': ['mean', 'sum', 'std']})`

In this case, the apply() method would be more cumbersome, as you would need to define a custom function that computes all three statistics and apply it to each column separately.

## Built-in functions provided by .agg()

The `agg()`

method in Pandas provides a number of built-in functions that you can use to aggregate data. These functions include:

`'sum'`

: computes the sum of the values in a column.`'mean'`

: computes the mean of the values in a column.`'count'`

: counts the number of non-NA/null values in a column.`'min'`

: computes the minimum value in a column.`'max'`

: computes the maximum value in a column.`'median'`

: computes the median of the values in a column.`'std'`

: computes the standard deviation of the values in a column.`'var'`

: computes the variance of the values in a column.`'sem'`

: computes the standard error of the mean of the values in a column.`'first'`

: returns the first non-NA/null value in a column.`'last'`

: returns the last non-NA/null value in a column.

Here is an example of how you can use these built-in functions with the `agg()`

method:

`df.agg({'col1': ['sum', 'mean'], 'col2': ['min', 'max']})`

This would compute the sum and mean of `col1`

and the minimum and maximum of `col2`

.

## Performance improvement between .apply() and .agg() using synthetic dataset.

The `apply()`

method and the `agg()`

method in Pandas are useful for applying functions to a DataFrame or Series. However, the `agg()`

method is generally faster than the `apply()`

method because it uses a more efficient implementation under the hood.

**Related: ***5 Pandas Performance Optimization Tips Without Crazy Setups.*

To illustrate the performance difference between the two methods, let's consider the following synthetic dataset:

```
import numpy as np
import pandas as pd
# Create a synthetic dataset with 10 million rows and 2 columns
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 100, size=(10000000, 2)), columns=['col1', 'col2'])
```

Suppose we want to apply the `mean()`

function to both columns of this dataset. We can do this using the `apply()`

method. We'll also use the time magic function to measure the performance.

`%time df.apply(np.mean)`

On my machine, this takes about 2.28 seconds to run.

Now let's try the same thing using the `agg()`

method:

`%time df.agg(np.mean)`

This takes about 0.68 seconds to run on my machine, which is significantly faster than the `apply()`

method.

This performance difference can be even more pronounced for more complex custom functions or larger datasets. Therefore, if performance is a concern and you want to apply multiple functions to a DataFrame or Series, it is generally a good idea to use the `agg()`

method rather than the `apply()`

method.