{"id":372,"date":"2022-12-26T00:00:00","date_gmt":"2022-12-26T00:00:00","guid":{"rendered":"https:\/\/tac.debuzzify.com\/?p=372"},"modified":"2023-06-20T09:35:39","modified_gmt":"2023-06-20T09:35:39","slug":"agg-method-vs-apply-pandas","status":"publish","type":"post","link":"https:\/\/www.the-analytics.club\/agg-method-vs-apply-pandas\/","title":{"rendered":"Use agg() Method Over apply() To Accumulate Pandas Dataframes Faster."},"content":{"rendered":"\n

Data scientists extensively use Pandas for data wrangling. Aggregation is a common task in data wrangling. But in Pandas, there is more than one way to do this.<\/p>\n\n\n\n

One way is to use the apply<\/code><\/a> method (or map<\/code><\/a> if it’s a series) on a dataframe. This is what I usually do. But the .<\/code>agg<\/code><\/a> method is another option I often neglect.<\/p>\n\n\n\n

For a one-time analysis, this isn’t a big deal. The difference is minuscule for smaller datasets.<\/p>\n\n\n\n

Yet, when we work on a larger dataset, we must know the impact, especially if the operation is supposed to run repeatedly, as in a data pipeline.<\/p>\n\n\n\n\n\n

This post discusses the difference and analyses its impact on an ordinary computer. I hope this will help you find the best option for your recurring data tasks.<\/p>\n\n\n\n

\n

Talking about Pandas, here are the top resources to boost your data-wrangling skills.<\/p>\n\n\n\n

[Book]<\/b> Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter<\/b> 3rd Edition<\/b><\/a> by<\/b> Wes McKinney<\/b><\/a>
[Online Course]<\/b>
Data Analysis with Pandas and Python by Boris Paskhaver<\/b><\/a><\/b> <\/b><\/p>\n\n\n\n

<\/b>The above links are carefully curated affiliate links. I earn a commission for qualified purchases at no extra cost to you. But it never affects what we pick.<\/p>\n<\/blockquote>\n\n\n\n

Related: <\/b>A Better Way to Summarize Pandas Dataframes.<\/i><\/b><\/a><\/p>\n\n\n\n

How to use the agg method to accumulate dataframe columns<\/h2>\n\n\n\n

Pandas agg() method allows you to simultaneously apply multiple functions to a DataFrame or Series. It can be used in various ways, depending on the desired output.<\/p>\n\n\n\n

Some common ways to use the agg()<\/code> method include:<\/p>\n\n\n\n

Applying a single function to all columns:<\/p>\n\n\n\n

<\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
df<\/span>.<\/span>agg<\/span>(<\/span>np<\/span>.<\/span>mean<\/span>)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n

Applying a single function to a specific column:<\/p>\n\n\n\n

<\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
df<\/span>[<\/span>'<\/span>column_name<\/span>'<\/span>].<\/span>agg<\/span>(<\/span>np<\/span>.<\/span>mean<\/span>)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n

Applying different functions to different columns:<\/p>\n\n\n\n

<\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
df<\/span>.<\/span>agg<\/span>({<\/span>'<\/span>col1<\/span>'<\/span>:<\/span> np<\/span>.<\/span>mean<\/span>,<\/span> <\/span>'<\/span>col2<\/span>'<\/span>:<\/span> np<\/span>.<\/span>sum<\/span>,<\/span> <\/span>'<\/span>col3<\/span>'<\/span>:<\/span> np<\/span>.<\/span>std<\/span>})<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n

Applying different functions to different columns and renaming the resulting columns:<\/p>\n\n\n\n

<\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
df<\/span>.<\/span>agg<\/span>({<\/span>'<\/span>col1<\/span>'<\/span>:<\/span> <\/span>[<\/span>'<\/span>mean<\/span>'<\/span>,<\/span> <\/span>'<\/span>min<\/span>'<\/span>],<\/span> <\/span>'<\/span>col2<\/span>'<\/span>:<\/span> <\/span>[<\/span>'<\/span>sum<\/span>'<\/span>,<\/span> <\/span>'<\/span>max<\/span>'<\/span>]})<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n

Applying a custom function to a column:<\/p>\n\n\n\n

<\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
def<\/span> <\/span>custom_function<\/span>(<\/span>x<\/span>):<\/span><\/span>\n    <\/span>return<\/span> x<\/span>.<\/span>mean<\/span>()<\/span> <\/span>-<\/span> x<\/span>.<\/span>min<\/span>()<\/span><\/span>\n<\/span>\ndf<\/span>[<\/span>'<\/span>col1<\/span>'<\/span>].<\/span>agg<\/span>(<\/span>custom_function<\/span>)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n

These are just a few examples of how the agg()<\/code> the method can be used. You can also use the groupby()<\/code> function in conjunction with the agg()<\/code> method to apply functions to data groups within the DataFrame.<\/p>\n\n\n\n

Related: <\/b>How to Run SQL Queries on Pandas Data Frames?<\/i><\/b><\/a><\/p>\n\n\n\n

Using the agg method with group by<\/h3>\n\n\n\n

The groupby()<\/code> function in Pandas allows you to group a DataFrame or Series by one or more columns and apply a function to each group. You can use the agg()<\/code> method in combination with the groupby()<\/code> function to apply multiple functions to the groups.<\/p>\n\n\n\n

To illustrate this, let’s consider the following synthetic dataset:<\/p>\n\n\n\n

<\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
import<\/span> numpy <\/span>as<\/span> np<\/span><\/span>\nimport<\/span> pandas <\/span>as<\/span> pd<\/span><\/span>\n<\/span>\n# Create a synthetic dataset with 10 million rows and 3 columns<\/span><\/span>\nnp<\/span>.<\/span>random<\/span>.<\/span>seed<\/span>(<\/span>0<\/span>)<\/span><\/span>\ndf <\/span>=<\/span> pd<\/span>.<\/span>DataFrame<\/span>(<\/span>np<\/span>.<\/span>random<\/span>.<\/span>randint<\/span>(<\/span>0<\/span>,<\/span> <\/span>100<\/span>,<\/span> <\/span>size<\/span>=<\/span>(<\/span>10000000<\/span>,<\/span> <\/span>3<\/span>)),<\/span> <\/span>columns<\/span>=<\/span>[<\/span>'<\/span>col1<\/span>'<\/span>,<\/span> <\/span>'<\/span>col2<\/span>'<\/span>,<\/span> <\/span>'<\/span>Category<\/span>'<\/span>])<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n

Suppose we want to group the data by the Category<\/code> column and compute the mean and sum of each column for each group. We can do this using the groupby()<\/code> and agg()<\/code> functions as follows:<\/p>\n\n\n\n

<\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
df<\/span>.<\/span>groupby<\/span>(<\/span>'<\/span>Category<\/span>'<\/span>).<\/span>agg<\/span>({<\/span>'<\/span>col1<\/span>'<\/span>:<\/span> <\/span>[<\/span>'<\/span>mean<\/span>'<\/span>,<\/span> <\/span>'<\/span>sum<\/span>'<\/span>],<\/span> <\/span>'<\/span>col2<\/span>'<\/span>:<\/span> <\/span>[<\/span>'<\/span>mean<\/span>'<\/span>,<\/span> <\/span>'<\/span>sum<\/span>'<\/span>]})<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n

This would return a new DataFrame with one row for each unique category in the Category<\/code> column, two columns for the mean and sum of col1,<\/code> and two columns for the mean and sum of col2<\/code>.<\/p>\n\n\n\n

You can also use the groupby()<\/code> function with the apply()<\/code> method to apply a custom function to each group. For example:<\/p>\n\n\n\n

<\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
def<\/span> <\/span>custom_function<\/span>(<\/span>group<\/span>):<\/span><\/span>\n    <\/span>return<\/span> group<\/span>.<\/span>mean<\/span>()<\/span> <\/span>-<\/span> group<\/span>.<\/span>min<\/span>()<\/span><\/span>\n<\/span>\ndf<\/span>.<\/span>groupby<\/span>(<\/span>'<\/span>Category<\/span>'<\/span>).<\/span>apply<\/span>(<\/span>custom_function<\/span>)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n

Benefits of using agg() over apply()<\/h2>\n\n\n\n

The agg()<\/code> method and the apply()<\/code> method in Pandas help use functions to a DataFrame or Series. However, there are a few key differences between the two approaches:<\/p>\n\n\n\n

    \n
  1. The agg()<\/code> method is specifically designed for applying multiple functions to a DataFrame or Series at once, whereas the apply()<\/code> method is more flexible and can use a single function or a user-defined function to a DataFrame or Series.<\/li>\n\n\n\n
  2. The agg()<\/code> method is generally faster than the apply()<\/code> method because it uses a more efficient implementation under the hood. This can be especially important when working with large datasets.<\/li>\n\n\n\n
  3. The agg()<\/code> method has a more concise syntax, as you can specify multiple functions in a single line of code. This can make your code easier to read and maintain.<\/li>\n<\/ol>\n\n\n\n

    Overall, the agg()<\/code> method is generally a better choice if you want to apply multiple functions to a DataFrame or Series, and performance is a concern. The apply()<\/code> method is more flexible and beneficial if you use a custom function that cannot be achieved with the built-in aggregation functions provided by agg()<\/code>.<\/p>\n\n\n\n

    Related: <\/b>Pandas Replace: The Faster and Better Approach to Change Values of a Column.<\/i><\/b><\/a><\/p>\n\n\n\n

    When to use apply() over .agg() for accumulation?<\/h2>\n\n\n\n

    As mentioned earlier, the agg()<\/code> method is specifically designed for applying multiple functions to a DataFrame or Series at once, whereas the apply()<\/code> method is more flexible and can be used to apply a single function or a user-defined function to a DataFrame or Series.<\/p>\n\n\n\n

    If you only want to apply a single function to a DataFrame or Series, and that function is not one of the built-in aggregation functions provided by agg()<\/code>, then you can use the apply()<\/code> method.<\/p>\n\n\n\n

    For example, suppose you have a DataFrame with a column of strings and want to apply a custom function that counts the number of vowels in each string. You could do this using the apply()<\/code> method:<\/p>\n\n\n\n

    <\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
    def<\/span> <\/span>count_vowels<\/span>(<\/span>x<\/span>):<\/span><\/span>\n    vowels <\/span>=<\/span> <\/span>'<\/span>aeiouAEIOU<\/span>'<\/span><\/span>\n    <\/span>return<\/span> <\/span>sum<\/span>(<\/span>c <\/span>in<\/span> vowels <\/span>for<\/span> c <\/span>in<\/span> x<\/span>)<\/span><\/span>\n<\/span>\ndf<\/span>[<\/span>'<\/span>string_column<\/span>'<\/span>].<\/span>apply<\/span>(<\/span>count_vowels<\/span>)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n

    In this case, it would not be possible to use the agg()<\/code> method because it does not have a built-in function for counting vowels.<\/p>\n\n\n\n

    On the other hand, if you want to apply multiple functions to a DataFrame or Series, then the agg()<\/code> method is generally a better choice because it is more concise and efficient.<\/p>\n\n\n\n

    For example, suppose you want to compute each column’s mean, sum, and standard deviation in a DataFrame. You could do this using the agg()<\/code> method:<\/p>\n\n\n\n

    <\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
    df<\/span>.<\/span>agg<\/span>({<\/span>'<\/span>col1<\/span>'<\/span>:<\/span> <\/span>[<\/span>'<\/span>mean<\/span>'<\/span>,<\/span> <\/span>'<\/span>sum<\/span>'<\/span>,<\/span> <\/span>'<\/span>std<\/span>'<\/span>],<\/span> <\/span>'<\/span>col2<\/span>'<\/span>:<\/span> <\/span>[<\/span>'<\/span>mean<\/span>'<\/span>,<\/span> <\/span>'<\/span>sum<\/span>'<\/span>,<\/span> <\/span>'<\/span>std<\/span>'<\/span>]})<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n

    In this case, the apply() method would be more cumbersome, as you would need to define a custom function that computes all three statistics and apply it to each column separately.<\/p>\n\n\n\n

    Built-in functions provided by .agg()<\/h2>\n\n\n\n

    The agg()<\/code> method in Pandas provides a number of built-in functions that you can use to aggregate data. These functions include:<\/p>\n\n\n\n

      \n
    1. 'sum'<\/code>: computes the sum of the values in a column.<\/li>\n\n\n\n
    2. 'mean'<\/code>: computes the mean of the values in a column.<\/li>\n\n\n\n
    3. 'count'<\/code>: counts the number of non-NA\/null values in a column.<\/li>\n\n\n\n
    4. 'min'<\/code>: computes the minimum value in a column.<\/li>\n\n\n\n
    5. 'max'<\/code>: computes the maximum value in a column.<\/li>\n\n\n\n
    6. 'median'<\/code>: computes the median of the values in a column.<\/li>\n\n\n\n
    7. 'std'<\/code>: computes the standard deviation of the values in a column.<\/li>\n\n\n\n
    8. 'var'<\/code>: computes the variance of the values in a column.<\/li>\n\n\n\n
    9. 'sem'<\/code>: computes the standard error of the mean of the values in a column.<\/li>\n\n\n\n
    10. 'first'<\/code>: returns the first non-NA\/null value in a column.<\/li>\n\n\n\n
    11. 'last'<\/code>: returns the last non-NA\/null value in a column.<\/li>\n<\/ol>\n\n\n\n

      Here is an example of how you can use these built-in functions with the agg()<\/code> method:<\/p>\n\n\n\n

      <\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
      df<\/span>.<\/span>agg<\/span>({<\/span>'<\/span>col1<\/span>'<\/span>:<\/span> <\/span>[<\/span>'<\/span>sum<\/span>'<\/span>,<\/span> <\/span>'<\/span>mean<\/span>'<\/span>],<\/span> <\/span>'<\/span>col2<\/span>'<\/span>:<\/span> <\/span>[<\/span>'<\/span>min<\/span>'<\/span>,<\/span> <\/span>'<\/span>max<\/span>'<\/span>]})<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n

      This would compute the sum and mean of col1<\/code> and the minimum and maximum of col2<\/code>.<\/p>\n\n\n\n

      Performance improvement between .apply() and .agg() using synthetic dataset.<\/h2>\n\n\n\n

      The apply()<\/code> method and the agg()<\/code> method in Pandas are useful for applying functions to a DataFrame or Series. However, the agg()<\/code> method is generally faster than the apply()<\/code> method because it uses a more efficient implementation under the hood.<\/p>\n\n\n\n

      Related: <\/b>5 Pandas Performance Optimization Tips Without Crazy Setups.<\/i><\/b><\/a><\/p>\n\n\n\n

      To illustrate the performance difference between the two methods, let’s consider the following synthetic dataset:<\/p>\n\n\n\n

      <\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
      import<\/span> numpy <\/span>as<\/span> np<\/span><\/span>\nimport<\/span> pandas <\/span>as<\/span> pd<\/span><\/span>\n<\/span>\n# Create a synthetic dataset with 10 million rows and 2 columns<\/span><\/span>\nnp<\/span>.<\/span>random<\/span>.<\/span>seed<\/span>(<\/span>0<\/span>)<\/span><\/span>\ndf <\/span>=<\/span> pd<\/span>.<\/span>DataFrame<\/span>(<\/span>np<\/span>.<\/span>random<\/span>.<\/span>randint<\/span>(<\/span>0<\/span>,<\/span> <\/span>100<\/span>,<\/span> <\/span>size<\/span>=<\/span>(<\/span>10000000<\/span>,<\/span> <\/span>2<\/span>)),<\/span> <\/span>columns<\/span>=<\/span>[<\/span>'<\/span>col1<\/span>'<\/span>,<\/span> <\/span>'<\/span>col2<\/span>'<\/span>])<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n
      <\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
      %<\/span>time df<\/span>.<\/span>apply<\/span>(<\/span>np<\/span>.<\/span>mean<\/span>)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n

      On my machine, this takes about 2.28 seconds to run.<\/p>\n\n\n\n

      Now let’s try the same thing using the agg()<\/code> method:<\/p>\n\n\n\n

      <\/circle><\/circle><\/circle><\/g><\/svg><\/span><\/path><\/path><\/svg><\/span>
      %<\/span>time df<\/span>.<\/span>agg<\/span>(<\/span>np<\/span>.<\/span>mean<\/span>)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n

      This takes about 0.68 seconds to run on my machine, which is significantly faster than the apply()<\/code> method.<\/p>\n\n\n\n

      This performance difference can be even more pronounced for more complex custom functions or larger datasets. Therefore, if performance is a concern and you want to apply multiple functions to a DataFrame or Series, it is generally a good idea to use the agg()<\/code> method rather than the apply()<\/code> method.<\/p>\n","protected":false},"excerpt":{"rendered":"

      Speed up aggregations using .agg method instead of .apply and .map methods.<\/p>\n","protected":false},"author":2,"featured_media":262,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_kad_blocks_custom_css":"","_kad_blocks_head_custom_js":"","_kad_blocks_body_custom_js":"","_kad_blocks_footer_custom_js":"","_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"footnotes":""},"categories":[7,4],"tags":[],"taxonomy_info":{"category":[{"value":7,"label":"Data Wrangling"},{"value":4,"label":"Data Science"}]},"featured_image_src_large":["https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/tuplex-1024x545.jpg",1024,545,true],"author_info":{"display_name":"Thuwarakesh","author_link":"https:\/\/www.the-analytics.club\/author\/thuwarakesh\/"},"comment_info":0,"category_info":[{"term_id":7,"name":"Data Wrangling","slug":"data-wrangling","term_group":0,"term_taxonomy_id":7,"taxonomy":"category","description":"","parent":4,"count":4,"filter":"raw","cat_ID":7,"category_count":4,"category_description":"","cat_name":"Data Wrangling","category_nicename":"data-wrangling","category_parent":4},{"term_id":4,"name":"Data Science","slug":"data-science","term_group":0,"term_taxonomy_id":4,"taxonomy":"category","description":"","parent":0,"count":22,"filter":"raw","cat_ID":4,"category_count":22,"category_description":"","cat_name":"Data Science","category_nicename":"data-science","category_parent":0}],"tag_info":false,"_links":{"self":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/372"}],"collection":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/comments?post=372"}],"version-history":[{"count":4,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/372\/revisions"}],"predecessor-version":[{"id":676,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/372\/revisions\/676"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/media\/262"}],"wp:attachment":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/media?parent=372"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/categories?post=372"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/tags?post=372"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}

      Suppose we want to apply the mean()<\/code> function to both columns of this dataset. We can do this using the apply()<\/code> method. We’ll also use the time magic function<\/a> to measure the performance.<\/p>\n\n\n\n