I have a table with some company information that we’re trying to clean up. In the first column is a clean company name, but not necessarily the correct one. In the second column, there is the correct company name, but often not very clean / missing. Here is an example.
Name | Info |
---|---|
Nike | Nike, a footwear manufacturer is headquartered in Oregon. |
ASG Shoes | Reebok |
Adidas | None |
We’re working with this dataset in Pandas. We’d like to follow the rules below.
- If the Name column is equal to the left side of the Info column, keep the name column. We would like this to be dynamic with the length of column 1. For “Nike”, it should check the first 4 letters of the Info column, for “ASG Shoes”, it should check the first 9 characters.
- If rule 1 is false, use the Info column.
- If Info is None, use the Name column.
The output we seek is a 3rd column that is the output of these rules. I am hoping someone can help me with writing this code in an efficient manner. There’s a lot going on here and I want to ensure I’m doing this properly. How can I achieve this output with the most efficient Python code possible?
Name | Info | Clean |
---|---|---|
Nike | Nike, a footwear manufacturer is headquartered in Oregon. | Nike |
ASG Shoes | Reebok | Reebok |
Adidas | None | Adidas |
Advertisement
Answer
You can start by creating another column that contains the length of your Name
column. This is really straight-forward. Let us call the new column Slicers
. What you can then do is to create a function that slices a string by a certain number and map this function to your columns Info
and Slicers
, where Info
is the string column that should be sliced and Slicers
defines the slicing number. (There may be even a pandas
implementation for this, but I do not know one). After that, you can compare your sliced info with your Name
variable and assign all matches to your Clean
column. Then, just apply a pandas coalesce
over your desired columns.
The code implementation is given below:
import pandas as pd def slicer(strings, slicers): return strings[:slicers] if isinstance(strings, str) else strings df = pd.DataFrame({ "Name": ["Nike", "ASG Shoes", "Adidas"], "Info": ["Nike, a footwear manufacturer is headquartered in Oregon.", "Reebok", None] }) # Define length column df["Slicers"] = df["Name"].str.len() # Slice Info column by length column and overwrite df["Slicers"] = list(map(slicer, df["Info"], df["Slicers"])) # Check whether sliced str column and name column are equal mask = df["Name"].eq(df["Slicers"]) # Overwrite if they are equal df.loc[mask, "Clean"] = df.loc[mask, "Name"] # Apply coalesce coalesce_rules = ["Clean", "Info", "Name"] df.drop(columns=["Slicers"]).assign(Clean=df[coalesce_rules].fillna(method="bfill", axis=1).iloc[:,0])
Output:
Name Info Clean 0 Nike Nike, a footwear manufacturer is headquartered... Nike 1 ASG Shoes Reebok Reebok 2 Adidas None Adidas
It only needs around five seconds for 3. Mio rows. Obviously, I do not know whether this is the most efficient way to solve your problem. But I think it’s an efficient one.