Pandas Replace: The Faster and Better Approach to Change Values of a Column. | 1
|

Pandas Replace: The Faster and Better Approach to Change Values of a Column.

Replacing values on a dataframe can sometimes be very tricky. Bulk replacement in a large dataset could be difficult and slow.

Yet, Pandas is flexible enough to do it better.

You may have to postpone the idea of moving to a distributed computing system. At least not to solve the value replacement problems.

Related: Is Your Python For-loop Slow? Use NumPy Instead

Replacing column values: The old approach.

The popular approach to replacing a value is locating the cell (or the column) and assigning it a new value.

Here’s how people do it (and how I did it when starting data analysis.)

TitleCategoryFollowersEngagement
Cristiano RonaldoSports with a ball400100000.07800000.0
Kylie ????Fashion|Modeling|BeautyN/A6200000.0
Leo MessiSports with a ball|Family306300000.0N/A
KendallModeling|FashionN/A3400000.0
Selena GomezMusic|Lifestyle295800000.02700000.0

The above is a dataset of top Instagram influencers. You can download the original dataset from Kaggle. Let’s suppose we’ve noticed that the dataset has “N/A” in place of some missing values.

People will do this.

df.loc[df.Followers == 'N/A', 'Followers'] = np.nan
df.loc[df.Engagement == 'N/A', 'Engagement'] = np.nan
Python

As we can see, Pandas will have to create an index that satisfies our conditions and then replace them. This isn’t very optimal.

It may seem this method gives us more control over the dataset. But in large datasets, or when we have complex criteria for each column, this could be painstaking.

Related: A Better Way to Summarize Pandas Dataframes.

Using the replace API on a Pandas dataframe

Pandas library has an incredible API called replace. This article covers the beauty of this API in different use cases.

Here’s how we could rework the above example.

df.replace(to_replace='N/A', value=np.nan)

# Or more implicitly
df.replace('N/A', np.nan)
Python

Pandas scan through the dataframe only once in the new method and replace the values as it checks.

Both results are in the following table.

TitleCategoryFollowersEngagement
Cristiano RonaldoSports with a ball400100000.07800000.0
Kylie ????Fashion|Modeling|BeautyNaN6200000.0
Leo MessiSports with a ball|Family306300000.0NaN
KendallModeling|FashionNaN3400000.0
Selena GomezMusic|Lifestyle295800000.02700000.0

Not to mention, the Replace method is more elegant. A different person can find this syntax more readable than the previous one.

The replace API has more advanced ways to change column values. Let’s have a look at them one-by-one.

Regex-based value replacement on a Pandas dataframe.

Regular expressions are a faster way to replace values with a dynamic selection. It would be challenging to replace these values on a massive dataset if it were not using Regex.

Suppose we need to replace the string “Sports with a \<anything>” with “Ball sports.” This isn’t an exact match we can do the usual way. A regex replacement is ideal for this.

The following replaces API call would do the trick as we’ve set the regex expression to match our criteria. Notice that we’ve set regex=True when we’re calling it.

df.replace(r'Sports with a \w+', 'Ball sports', regex=True)
Python

The resulting table would look like this:

TitleCategoryFollowersEngagement
Cristiano RonaldoBall sports400100000.07800000.0
Kylie ????Fashion|Modeling|BeautyN/A6200000.0
Leo MessiBall sports|Family306300000.0N/A
KendallModeling|FashionN/A3400000.0
Selena GomezMusic|Lifestyle295800000.02700000.0

Using dictionaries to replace values on a Pandas dataframe.

Dictionaries give us a very straightforward way to define value mappings to replace. We can directly pass a dictionary with current values as keys and replacements as values.

The below code will replace ‘Music’ with ‘Performing Art’ and ‘N/A’s with np.nan. Replacement happens in all columns, as we didn’t specify a column name in particular.

df.replace({'N/A': np.nan, 'Music': 'Performing Art'})
Python
TitleCategoryFollowersEngagement
Cristiano RonaldoSports with a ball400100000.07800000.0
Kylie ????Fashion|Modeling|BeautyNaN6200000.0
Leo MessiSports with a ball|Family306300000.0NaN
KendallModeling|FashionNaN3400000.0
Selena GomezMusic|Lifestyle295800000.02700000.0

The power of dictionaries in replacing values is more than simply giving a mapping. We can have nested dictionaries to specify a column first and then map. This way, we can specify different replacement mappings for every column.

df.replace({
    'Engagement':{'N/A': np.nan}, 
    'Category':{'Sports with a ball': 'Sports'},
})
Python
TitleCategoryFollowersEngagement
Cristiano RonaldoSports400100000.07800000.0
Kylie ????Fashion|Modeling|BeautyN/A6200000.0
Leo MessiSports with a ball|Family306300000.0NaN
KendallModeling|FashionN/A3400000.0
Selena GomezMusic|Lifestyle295800000.02700000.0

Lastly, we can combine the merits of nested dictionaries with regex. We can specify different rules for each column and replace them with new values.

The below code will do a regex-based string replacement on column ‘category’. Yet, on the ‘followers’ column it assigns np.nan to the missing values.

df.replace({
    'Category': {r'Sports with a \w+': 'Ball sports'},
    'Followers': {'N/A': np.nan}
}, regex=True)
Python
TitleCategoryFollowersEngagement
Cristiano RonaldoBall sports400100000.07800000.0
Kylie ????Fashion|Modeling|BeautyNaN6200000.0
Leo MessiBall sports|Family306300000.0N/A
KendallModeling|FashionNaN3400000.0
Selena GomezMusic|Lifestyle295800000.02700000.0

This method is very much flexible and faster. The regex search happens only within one column instead of the entire dataset. This is particularly useful if the dataset contains a ton of records.

Related: How to Speed up Python Data Pipelines up to 91X?


Thanks for the read, friend. It seems you and I have lots of common interests. Say Hi to me on LinkedIn, Twitter, and Medium. I’ll break the ice for you.

Not a Medium member yet? Please use this link to become a member because I earn a commission for referring at no extra cost for you.

Similar Posts