Pandas Replace: The Faster And Better Approach To Change Values Of A Column.

Replacing values on a dataframe can sometimes be very tricky. Bulk replacement in a large dataset could be difficult and slow.

Yet, Pandas is flexible enough to do it better.

You may have to postpone the idea of moving to a distributed computing system. At least not to solve the value replacement problems.

Replacing column values: The old approach.

The popular approach to replacing a value is locating the cell (or the column) and assigning it a new value.

Here’s how people do it (and how I did it when starting data analysis.)

Title	Category	Followers	Engagement
Cristiano Ronaldo	Sports with a ball	400100000.0	7800000.0
Kylie ????	Fashion\|Modeling\|Beauty	N/A	6200000.0
Leo Messi	Sports with a ball\|Family	306300000.0	N/A
Kendall	Modeling\|Fashion	N/A	3400000.0
Selena Gomez	Music\|Lifestyle	295800000.0	2700000.0

The above is a dataset of top Instagram influencers. You can download the original dataset from Kaggle. Let’s suppose we’ve noticed that the dataset has “N/A” in place of some missing values.

People will do this.

df.loc[df.Followers == 'N/A', 'Followers'] = np.nan
df.loc[df.Engagement == 'N/A', 'Engagement'] = np.nan

Python

As we can see, Pandas will have to create an index that satisfies our conditions and then replace them. This isn’t very optimal.

It may seem this method gives us more control over the dataset. But in large datasets, or when we have complex criteria for each column, this could be painstaking.

Using the replace API on a Pandas dataframe

Pandas library has an incredible API called replace. This article covers the beauty of this API in different use cases.

Here’s how we could rework the above example.

df.replace(to_replace='N/A', value=np.nan)

# Or more implicitly
df.replace('N/A', np.nan)

Python

Pandas scan through the dataframe only once in the new method and replace the values as it checks.

Both results are in the following table.

Title	Category	Followers	Engagement
Cristiano Ronaldo	Sports with a ball	400100000.0	7800000.0
Kylie ????	Fashion\|Modeling\|Beauty	NaN	6200000.0
Leo Messi	Sports with a ball\|Family	306300000.0	NaN
Kendall	Modeling\|Fashion	NaN	3400000.0
Selena Gomez	Music\|Lifestyle	295800000.0	2700000.0

Not to mention, the Replace method is more elegant. A different person can find this syntax more readable than the previous one.

The replace API has more advanced ways to change column values. Let’s have a look at them one-by-one.

Regex-based value replacement on a Pandas dataframe.

Regular expressions are a faster way to replace values with a dynamic selection. It would be challenging to replace these values on a massive dataset if it were not using Regex.

Suppose we need to replace the string “Sports with a \<anything>” with “Ball sports.” This isn’t an exact match we can do the usual way. A regex replacement is ideal for this.

The following replaces API call would do the trick as we’ve set the regex expression to match our criteria. Notice that we’ve set regex=True when we’re calling it.

df.replace(r'Sports with a \w+', 'Ball sports', regex=True)

Python

The resulting table would look like this:

Title	Category	Followers	Engagement
Cristiano Ronaldo	Ball sports	400100000.0	7800000.0
Kylie ????	Fashion\|Modeling\|Beauty	N/A	6200000.0
Leo Messi	Ball sports\|Family	306300000.0	N/A
Kendall	Modeling\|Fashion	N/A	3400000.0
Selena Gomez	Music\|Lifestyle	295800000.0	2700000.0

Using dictionaries to replace values on a Pandas dataframe.

Dictionaries give us a very straightforward way to define value mappings to replace. We can directly pass a dictionary with current values as keys and replacements as values.

The below code will replace ‘Music’ with ‘Performing Art’ and ‘N/A’s with np.nan. Replacement happens in all columns, as we didn’t specify a column name in particular.

df.replace({'N/A': np.nan, 'Music': 'Performing Art'})

Python

Title	Category	Followers	Engagement
Cristiano Ronaldo	Sports with a ball	400100000.0	7800000.0
Kylie ????	Fashion\|Modeling\|Beauty	NaN	6200000.0
Leo Messi	Sports with a ball\|Family	306300000.0	NaN
Kendall	Modeling\|Fashion	NaN	3400000.0
Selena Gomez	Music\|Lifestyle	295800000.0	2700000.0

The power of dictionaries in replacing values is more than simply giving a mapping. We can have nested dictionaries to specify a column first and then map. This way, we can specify different replacement mappings for every column.

df.replace({
    'Engagement':{'N/A': np.nan}, 
    'Category':{'Sports with a ball': 'Sports'},
})

Python

Title	Category	Followers	Engagement
Cristiano Ronaldo	Sports	400100000.0	7800000.0
Kylie ????	Fashion\|Modeling\|Beauty	N/A	6200000.0
Leo Messi	Sports with a ball\|Family	306300000.0	NaN
Kendall	Modeling\|Fashion	N/A	3400000.0
Selena Gomez	Music\|Lifestyle	295800000.0	2700000.0

Lastly, we can combine the merits of nested dictionaries with regex. We can specify different rules for each column and replace them with new values.

The below code will do a regex-based string replacement on column ‘category’. Yet, on the ‘followers’ column it assigns np.nan to the missing values.

df.replace({
    'Category': {r'Sports with a \w+': 'Ball sports'},
    'Followers': {'N/A': np.nan}
}, regex=True)

Python

Title	Category	Followers	Engagement
Cristiano Ronaldo	Ball sports	400100000.0	7800000.0
Kylie ????	Fashion\|Modeling\|Beauty	NaN	6200000.0
Leo Messi	Ball sports\|Family	306300000.0	N/A
Kendall	Modeling\|Fashion	NaN	3400000.0
Selena Gomez	Music\|Lifestyle	295800000.0	2700000.0

This method is very much flexible and faster. The regex search happens only within one column instead of the entire dataset. This is particularly useful if the dataset contains a ton of records.

Thanks for the read, friend. It seems you and I have lots of common interests. Say Hi to me on LinkedIn, Twitter, and Medium. I’ll break the ice for you.
Not a Medium member yet? Please use this link to become a member because I earn a commission for referring at no extra cost for you.

Pandas Replace: The Faster and Better Approach to Change Values of a Column.

Replacing column values: The old approach.

Using the replace API on a Pandas dataframe

Regex-based value replacement on a Pandas dataframe.

Using dictionaries to replace values on a Pandas dataframe.

Running Election Campaigns With K-Means Clustering.

How to Do a Ton of Analysis in Python in the Blink of An Eye.

How to Improve Data Quality With Data Quality Assessment?

Creating Wordclouds in Python: The Quick and Easy Guide

The #1 Mistake Companies Make When Creating Their Data Science Foundation

Use agg() Method Over apply() To Accumulate Pandas Dataframes Faster.

How do we work?

Subscribe to Newsletter

Beware of AI stocks.

What is Generative AI | A non-technical Introduction

The ChatGPT Millionaire: A Tech Geek’s Deep Dive

Our Socials

Blog

Links

Replacing column values: The old approach.

Using the replace API on a Pandas dataframe

Regex-based value replacement on a Pandas dataframe.

Using dictionaries to replace values on a Pandas dataframe.

Similar Posts

How do we work?

Blog

Links