Pandas Replace: The Faster and Better Approach to Change Values of a Column.
Replacing values on a dataframe can sometimes be very tricky. Bulk replacement in a large dataset could be difficult and slow.
Yet, Pandas is flexible enough to do it better.
You may have to postpone the idea of moving to a distributed computing system. At least not to solve the value replacement problems.
Related: Is Your Python For-loop Slow? Use NumPy Instead
Replacing column values: The old approach.
The popular approach to replacing a value is locating the cell (or the column) and assigning it a new value.
Here’s how people do it (and how I did it when starting data analysis.)
Title | Category | Followers | Engagement |
---|---|---|---|
Cristiano Ronaldo | Sports with a ball | 400100000.0 | 7800000.0 |
Kylie ???? | Fashion|Modeling|Beauty | N/A | 6200000.0 |
Leo Messi | Sports with a ball|Family | 306300000.0 | N/A |
Kendall | Modeling|Fashion | N/A | 3400000.0 |
Selena Gomez | Music|Lifestyle | 295800000.0 | 2700000.0 |
The above is a dataset of top Instagram influencers. You can download the original dataset from Kaggle. Let’s suppose we’ve noticed that the dataset has “N/A” in place of some missing values.
People will do this.
df.loc[df.Followers == 'N/A', 'Followers'] = np.nan
df.loc[df.Engagement == 'N/A', 'Engagement'] = np.nan
PythonAs we can see, Pandas will have to create an index that satisfies our conditions and then replace them. This isn’t very optimal.
It may seem this method gives us more control over the dataset. But in large datasets, or when we have complex criteria for each column, this could be painstaking.
Related: A Better Way to Summarize Pandas Dataframes.
Using the replace API on a Pandas dataframe
Pandas library has an incredible API called replace. This article covers the beauty of this API in different use cases.
Here’s how we could rework the above example.
df.replace(to_replace='N/A', value=np.nan)
# Or more implicitly
df.replace('N/A', np.nan)
PythonPandas scan through the dataframe only once in the new method and replace the values as it checks.
Both results are in the following table.
Title | Category | Followers | Engagement |
---|---|---|---|
Cristiano Ronaldo | Sports with a ball | 400100000.0 | 7800000.0 |
Kylie ???? | Fashion|Modeling|Beauty | NaN | 6200000.0 |
Leo Messi | Sports with a ball|Family | 306300000.0 | NaN |
Kendall | Modeling|Fashion | NaN | 3400000.0 |
Selena Gomez | Music|Lifestyle | 295800000.0 | 2700000.0 |
Not to mention, the Replace method is more elegant. A different person can find this syntax more readable than the previous one.
The replace API has more advanced ways to change column values. Let’s have a look at them one-by-one.
Regex-based value replacement on a Pandas dataframe.
Regular expressions are a faster way to replace values with a dynamic selection. It would be challenging to replace these values on a massive dataset if it were not using Regex.
Suppose we need to replace the string “Sports with a \<anything>” with “Ball sports.” This isn’t an exact match we can do the usual way. A regex replacement is ideal for this.
The following replaces API call would do the trick as we’ve set the regex expression to match our criteria. Notice that we’ve set regex=True when we’re calling it.
df.replace(r'Sports with a \w+', 'Ball sports', regex=True)
PythonThe resulting table would look like this:
Title | Category | Followers | Engagement |
---|---|---|---|
Cristiano Ronaldo | Ball sports | 400100000.0 | 7800000.0 |
Kylie ???? | Fashion|Modeling|Beauty | N/A | 6200000.0 |
Leo Messi | Ball sports|Family | 306300000.0 | N/A |
Kendall | Modeling|Fashion | N/A | 3400000.0 |
Selena Gomez | Music|Lifestyle | 295800000.0 | 2700000.0 |
Using dictionaries to replace values on a Pandas dataframe.
Dictionaries give us a very straightforward way to define value mappings to replace. We can directly pass a dictionary with current values as keys and replacements as values.
The below code will replace ‘Music’ with ‘Performing Art’ and ‘N/A’s with np.nan. Replacement happens in all columns, as we didn’t specify a column name in particular.
df.replace({'N/A': np.nan, 'Music': 'Performing Art'})
PythonTitle | Category | Followers | Engagement |
---|---|---|---|
Cristiano Ronaldo | Sports with a ball | 400100000.0 | 7800000.0 |
Kylie ???? | Fashion|Modeling|Beauty | NaN | 6200000.0 |
Leo Messi | Sports with a ball|Family | 306300000.0 | NaN |
Kendall | Modeling|Fashion | NaN | 3400000.0 |
Selena Gomez | Music|Lifestyle | 295800000.0 | 2700000.0 |
The power of dictionaries in replacing values is more than simply giving a mapping. We can have nested dictionaries to specify a column first and then map. This way, we can specify different replacement mappings for every column.
df.replace({
'Engagement':{'N/A': np.nan},
'Category':{'Sports with a ball': 'Sports'},
})
PythonTitle | Category | Followers | Engagement |
---|---|---|---|
Cristiano Ronaldo | Sports | 400100000.0 | 7800000.0 |
Kylie ???? | Fashion|Modeling|Beauty | N/A | 6200000.0 |
Leo Messi | Sports with a ball|Family | 306300000.0 | NaN |
Kendall | Modeling|Fashion | N/A | 3400000.0 |
Selena Gomez | Music|Lifestyle | 295800000.0 | 2700000.0 |
Lastly, we can combine the merits of nested dictionaries with regex. We can specify different rules for each column and replace them with new values.
The below code will do a regex-based string replacement on column ‘category’. Yet, on the ‘followers’ column it assigns np.nan to the missing values.
df.replace({
'Category': {r'Sports with a \w+': 'Ball sports'},
'Followers': {'N/A': np.nan}
}, regex=True)
PythonTitle | Category | Followers | Engagement |
---|---|---|---|
Cristiano Ronaldo | Ball sports | 400100000.0 | 7800000.0 |
Kylie ???? | Fashion|Modeling|Beauty | NaN | 6200000.0 |
Leo Messi | Ball sports|Family | 306300000.0 | N/A |
Kendall | Modeling|Fashion | NaN | 3400000.0 |
Selena Gomez | Music|Lifestyle | 295800000.0 | 2700000.0 |
This method is very much flexible and faster. The regex search happens only within one column instead of the entire dataset. This is particularly useful if the dataset contains a ton of records.
Related: How to Speed up Python Data Pipelines up to 91X?
Thanks for the read, friend. It seems you and I have lots of common interests. Say Hi to me on LinkedIn, Twitter, and Medium. I’ll break the ice for you.
Not a Medium member yet? Please use this link to become a member because I earn a commission for referring at no extra cost for you.