Skip to content
Advertisement

pandas, merge duplicates if row contains wildcard text

I have a dataset of duplicates (ID). Dataset contains both information and emails. I’m trying to concatenate the emails (if row have character @) and then remove the duplicates.

My original dataset: Data

What I wish to accomplish: Wish

My current code is a modification of Eric Ed Lohmar code and give the following output. My issue is that I’m not able to exclude “noise” data as I get: , nan , , 0 and -, in my final result.

Current Output: Current Output

How do I append rows with only email addresses? I thought that I could skip to append all rows that contain the character @, by using wildcard and replacing this part:

JavaScript

to any of these attempts, but nothing is working:

1.

JavaScript

Error:

JavaScript

Error:

JavaScript

Full Code:

JavaScript

Advertisement

Answer

I would use a “split-apply-combine” approach. In pandas you can use the groupby function to do this and then apply a function to combine the email addresses to each group (in this case you can group by the ID col.

I wrote a function to combine the email addresses for a given column:

JavaScript

Then I wrote a function to take the first row of each grouped dataframe and call the combine function on the email columns to populate the row email values:

JavaScript

Then you can apply the combine_duplicate_rows to your groups and you get the solution:

JavaScript

You then have a duplicate ID column, but you can just delete that

JavaScript
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement