Skip to content
Advertisement

How do I coalesce Pandas columns only where the beginnings of the columns don’t match?

I have a table with some company information that we’re trying to clean up. In the first column is a clean company name, but not necessarily the correct one. In the second column, there is the correct company name, but often not very clean / missing. Here is an example.

Name Info
Nike Nike, a footwear manufacturer is headquartered in Oregon.
ASG Shoes Reebok
Adidas None

We’re working with this dataset in Pandas. We’d like to follow the rules below.

  1. If the Name column is equal to the left side of the Info column, keep the name column. We would like this to be dynamic with the length of column 1. For “Nike”, it should check the first 4 letters of the Info column, for “ASG Shoes”, it should check the first 9 characters.
  2. If rule 1 is false, use the Info column.
  3. If Info is None, use the Name column.

The output we seek is a 3rd column that is the output of these rules. I am hoping someone can help me with writing this code in an efficient manner. There’s a lot going on here and I want to ensure I’m doing this properly. How can I achieve this output with the most efficient Python code possible?

Name Info Clean
Nike Nike, a footwear manufacturer is headquartered in Oregon. Nike
ASG Shoes Reebok Reebok
Adidas None Adidas

Advertisement

Answer

You can start by creating another column that contains the length of your Name column. This is really straight-forward. Let us call the new column Slicers. What you can then do is to create a function that slices a string by a certain number and map this function to your columns Info and Slicers, where Info is the string column that should be sliced and Slicers defines the slicing number. (There may be even a pandas implementation for this, but I do not know one). After that, you can compare your sliced info with your Name variable and assign all matches to your Clean column. Then, just apply a pandas coalesce over your desired columns.

The code implementation is given below:

import pandas as pd

def slicer(strings, slicers):
    return strings[:slicers] if isinstance(strings, str) else strings

df = pd.DataFrame({
    "Name": ["Nike", "ASG Shoes", "Adidas"],
    "Info": ["Nike, a footwear manufacturer is headquartered in Oregon.", "Reebok", None] 
})
# Define length column
df["Slicers"] = df["Name"].str.len()
# Slice Info column by length column and overwrite
df["Slicers"] = list(map(slicer, df["Info"], df["Slicers"]))
# Check whether sliced str column and name column are equal
mask = df["Name"].eq(df["Slicers"])
# Overwrite if they are equal
df.loc[mask, "Clean"] = df.loc[mask, "Name"]

# Apply coalesce
coalesce_rules = ["Clean", "Info", "Name"]
df.drop(columns=["Slicers"]).assign(Clean=df[coalesce_rules].fillna(method="bfill", axis=1).iloc[:,0])

Output:

    Name       Info                                                Clean
0   Nike       Nike, a footwear manufacturer is headquartered...   Nike
1   ASG Shoes  Reebok                                              Reebok
2   Adidas     None                                                Adidas

It only needs around five seconds for 3. Mio rows. Obviously, I do not know whether this is the most efficient way to solve your problem. But I think it’s an efficient one.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement