Skip to content
Advertisement

How do I coalesce Pandas columns only where the beginnings of the columns don’t match?

I have a table with some company information that we’re trying to clean up. In the first column is a clean company name, but not necessarily the correct one. In the second column, there is the correct company name, but often not very clean / missing. Here is an example.

Name Info
Nike Nike, a footwear manufacturer is headquartered in Oregon.
ASG Shoes Reebok
Adidas None

We’re working with this dataset in Pandas. We’d like to follow the rules below.

  1. If the Name column is equal to the left side of the Info column, keep the name column. We would like this to be dynamic with the length of column 1. For “Nike”, it should check the first 4 letters of the Info column, for “ASG Shoes”, it should check the first 9 characters.
  2. If rule 1 is false, use the Info column.
  3. If Info is None, use the Name column.

The output we seek is a 3rd column that is the output of these rules. I am hoping someone can help me with writing this code in an efficient manner. There’s a lot going on here and I want to ensure I’m doing this properly. How can I achieve this output with the most efficient Python code possible?

Name Info Clean
Nike Nike, a footwear manufacturer is headquartered in Oregon. Nike
ASG Shoes Reebok Reebok
Adidas None Adidas

Advertisement

Answer

You can start by creating another column that contains the length of your Name column. This is really straight-forward. Let us call the new column Slicers. What you can then do is to create a function that slices a string by a certain number and map this function to your columns Info and Slicers, where Info is the string column that should be sliced and Slicers defines the slicing number. (There may be even a pandas implementation for this, but I do not know one). After that, you can compare your sliced info with your Name variable and assign all matches to your Clean column. Then, just apply a pandas coalesce over your desired columns.

The code implementation is given below:

JavaScript

Output:

JavaScript

It only needs around five seconds for 3. Mio rows. Obviously, I do not know whether this is the most efficient way to solve your problem. But I think it’s an efficient one.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement