Pandas Replace: The Faster and Better Approach to Change Values of a Column.

Feb. 20, 2022

Pandas Replace: The Faster and Better Approach to Change Values of a Column.

Replacing values on a dataframe can sometimes be very tricky. Bulk replacement in a large dataset could be difficult and slow.

Yet, Pandas is flexible enough to do it better.

You may have to postpone the idea of moving to a distributed computing system. At least, not to solve the value replacement problems.

Replacing column values: The old approach.

The popular approach to replacing a value is locating the cell (or the column) and assigning it a new value.

Here's how people do it (and how I did it when starting data analysis.)

Title Category Followers Engagement
Cristiano Ronaldo Sports with a ball 400100000.0 7800000.0
Kylie 🤍 Fashion|Modeling|Beauty N/A 6200000.0
Leo Messi Sports with a ball|Family 306300000.0 N/A
Kendall Modeling|Fashion N/A 3400000.0
Selena Gomez Music|Lifestyle 295800000.0 2700000.0

The above is a dataset of top Instagram influencers. You can download the original dataset from Kaggle. Let's suppose we've noticed that the dataset has "N/A" in place of some missing values.

People will do this.

df.loc[df.Followers == 'N/A', 'Followers'] = np.nan
df.loc[df.Engagement == 'N/A', 'Engagement'] = np.nan

As we can see, Pandas will have to create an index that satisfies our conditions and then replace them. This isn't very optimal.

It may seem this method gives us more control over the dataset. But in large datasets, or when we have complex criteria for each column, this could be painstaking.

Related: A Better Way to Summarize Pandas Dataframes.

Using the replace API on a Pandas dataframe

Pandas library has an incredible API called replace. This article covers the beauty of this API in different use cases.

Here's how we could rework the above example.

df.replace(to_replace='N/A', value=np.nan)

# Or more implicitly
df.replace('N/A', np.nan)

Pandas scan through the dataframe only once in the new method and replace the values as it checks.

Both results in the following table.

Title Category Followers Engagement
Cristiano Ronaldo Sports with a ball 400100000.0 7800000.0
Kylie 🤍 Fashion|Modeling|Beauty NaN 6200000.0
Leo Messi Sports with a ball|Family 306300000.0 NaN
Kendall Modeling|Fashion NaN 3400000.0
Selena Gomez Music|Lifestyle 295800000.0 2700000.0

Not to mention, the Replace method is more elegant. A different person can find this syntax more readable than the previous one.

The replace API has more advanced ways to change column values. Let's have a look at them one-by-one.

Regex-based value replacement on a Pandas dataframe.

Regular expressions are a faster way to replace values with a dynamic selection. It would be challenging to replace these values on a massive dataset if it were not using Regex.

Suppose we need to replace the string "Sports with a \<anything>" with "Ball sports." This isn't an exact match we can do the usual way. A regex replacement is ideal for this.

The following replace API call would do the trick as we've set the regex expression to match our criteria. Notice that we've set regex=True when we're calling it.

df.replace(r'Sports with a \w+', 'Ball sports', regex=True)

The resulting table would look like this:

Title Category Followers Engagement
Cristiano Ronaldo Ball sports 400100000.0 7800000.0
Kylie 🤍 Fashion|Modeling|Beauty N/A 6200000.0
Leo Messi Ball sports|Family 306300000.0 N/A
Kendall Modeling|Fashion N/A 3400000.0
Selena Gomez Music|Lifestyle 295800000.0 2700000.0

Using dictionaries to replace values on a Pandas dataframe.

Dictionaries give us a very straightforward way to define value mappings to replace. We can directly pass a dictionary with current values as keys and replacements as values.

The below code will replace 'Music' with 'Performing Art' and 'N/A's with np.nan. Replacement happens in all columns as we didn't specify a column name in particular.

df.replace({'N/A': np.nan, 'Music': 'Performing Art'})
Title Category Followers Engagement
Cristiano Ronaldo Sports with a ball 400100000.0 7800000.0
Kylie 🤍 Fashion|Modeling|Beauty NaN 6200000.0
Leo Messi Sports with a ball|Family 306300000.0 NaN
Kendall Modeling|Fashion NaN 3400000.0
Selena Gomez Music|Lifestyle 295800000.0 2700000.0

The power of dictionaries in replacing values is more than simply giving a mapping. We can have nested dictionaries to specify a column first and then mapping. This way we can specify different replacement mappings for every column.

df.replace({
    'Engagement':{'N/A': np.nan}, 
    'Category':{'Sports with a ball': 'Sports'},
})
Title Category Followers Engagement
Cristiano Ronaldo Sports 400100000.0 7800000.0
Kylie 🤍 Fashion|Modeling|Beauty N/A 6200000.0
Leo Messi Sports with a ball|Family 306300000.0 NaN
Kendall Modeling|Fashion N/A 3400000.0
Selena Gomez Music|Lifestyle 295800000.0 2700000.0

Lastly we can combine the merits of nested dictionaries with regex. We can specify different rules to each column and replace them with new value.

The below code will do a regex-based string replace on column 'category'. Yet, on the 'followers' column it assigns np.nan to the missing values.

df.replace({
    'Category': {r'Sports with a \w+': 'Ball sports'},
    'Followers': {'N/A': np.nan}
}, regex=True)
Title Category Followers Engagement
Cristiano Ronaldo Ball sports 400100000.0 7800000.0
Kylie 🤍 Fashion|Modeling|Beauty NaN 6200000.0
Leo Messi Ball sports|Family 306300000.0 N/A
Kendall Modeling|Fashion NaN 3400000.0
Selena Gomez Music|Lifestyle 295800000.0 2700000.0

This method is very much flexible and faster. The regex search happens only within one column instead of the entire dataset. This is particularly useful if the dataset contains a ton of records.

Related: How to Speed up Python Data Pipelines up to 91X?


Thanks for the read, friend. It seems you and I have lots of common interests. Say Hi to me on LinkedIn, Twitter, and Medium. I’ll break the ice for you.

Not a Medium member yet? Please use this link to become a member because I earn a commission for referring at no extra cost for you.

How we work

Readers support The Analytics Club. We earn through display ads. Also, when you buy something we recommend, we may get an affiliate commission. But it never affects your price or what we pick.

Connect with us