How do I coalesce Pandas columns only where the beginnings of the columns don’t match?

Question

I have a table with some company information that we're trying to clean up. In the first column is a clean company name, but not necessarily the correct one. In the second column, there is the correct company name, but often not very clean / missing. Here is an example. Name Info Nike Nike, a footwear manufacturer is headquartered in

Accepted Answer

You can start by creating another column that contains the length of your Name column. This is really straight-forward. Let us call the new column Slicers. What you can then do is to create a function that slices a string by a certain number and map this function to your columns Info and Slicers, where Info is the string column that should be sliced and Slicers defines the slicing number. (There may be even a pandas implementation for this, but I do not know one). After that, you can compare your sliced info with your Name variable and assign all matches to your Clean column. Then, just apply a pandas coalesce over your desired columns.The code implementation is given below:import pandas as pddef slicer(strings, slicers):    return strings[:slicers] if isinstance(strings, str) else stringsdf = pd.DataFrame({    "Name": ["Nike", "ASG Shoes", "Adidas"],    "Info": ["Nike, a footwear manufacturer is headquartered in Oregon.", "Reebok", None] })# Define length columndf["Slicers"] = df["Name"].str.len()# Slice Info column by length column and overwritedf["Slicers"] = list(map(slicer, df["Info"], df["Slicers"]))# Check whether sliced str column and name column are equalmask = df["Name"].eq(df["Slicers"])# Overwrite if they are equaldf.loc[mask, "Clean"] = df.loc[mask, "Name"]# Apply coalescecoalesce_rules = ["Clean", "Info", "Name"]df.drop(columns=["Slicers"]).assign(Clean=df[coalesce_rules].fillna(method="bfill", axis=1).iloc[:,0])Output:    Name       Info                                                Clean0   Nike       Nike, a footwear manufacturer is headquartered...   Nike1   ASG Shoes  Reebok                                              Reebok2   Adidas     None                                                AdidasIt only needs around five seconds for 3. Mio rows. Obviously, I do not know whether this is the most efficient way to solve your problem. But I think it&#8217;s an efficient one.

Name	Info
Nike	Nike, a footwear manufacturer is headquartered in Oregon.
ASG Shoes	Reebok
Adidas	None

Advertisement

Answer