pandas, merge duplicates if row contains wildcard text

Tags: , , ,



I have a dataset of duplicates (ID). Dataset contains both information and emails. I’m trying to concatenate the emails (if row have character @) and then remove the duplicates.

My original dataset: Data

What I wish to accomplish: Wish

My current code is a modification of Eric Ed Lohmar code and give the following output. My issue is that I’m not able to exclude “noise” data as I get: , nan , , 0 and -, in my final result.

Current Output: Current Output

How do I append rows with only email addresses? I thought that I could skip to append all rows that contain the character @, by using wildcard and replacing this part:

if row['Store1_Email']: # <- not working

to any of these attempts, but nothing is working:

1.

if str('**@**') in row['Store1_Email']: # <- not working

Error:

Traceback (most recent call last):
  File "g:/Till/till_duplicate.py", line 35, in <module>
    if str('**@**') in row['Store1_Email']:
TypeError: argument of type 'float' is not iterable
PS G:Till>

Error:

Traceback (most recent call last):
  File "g:/Till/till_duplicate.py", line 35, in <module>
    if df_merged_duplicates[df_merged_duplicates.loc[i, 'Store1_Email'].str.contains('@')]:
AttributeError: 'str' object has no attribute 'str'
PS G:Till>

Full Code:

import pandas as pd
import os
from datetime import datetime
import time
from shutil import copyfile
from functools import reduce
import numpy as np
import glob


# # Settings
path_data_sources = 'G:/Till/'

# print(path_data_sources + 'test_duplicates - Copy.xlsx')

## https://stackoverflow.com/questions/36271413/pandas-merge-nearly-duplicate-rows-based-on-column-value
# df_merged_duplicates = pd.read_excel(path_data_sources + 'test_duplicates - Source.xlsx', sheet_name="Sheet1", dtype=str)

data = {'ID':['001', '001', '002', '002', '003', '003', '004', '004', '005', '005', '006', '006', '007', '007', '008', '008', '009', '009', '010', '010', '011', '011', '012', '012', '013', '013', '014', '014'], 
    'Header 1':['AA', 'AA', 'BB', 'BB', 'CC', 'CC', 'DD', 'DD', 'EE', 'EE', 'FF', 'FF', 'GG', 'GG', 'HH', 'HH', 'II', 'II', 'JJ', 'JJ', 'KK', 'KK', 'LL', 'LL', 'MM', 'MM', 'NN', 'NN'], 
    'Header 2':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 
    'Header 3':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 
    'Header 4':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 
    'Header 5':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 
    'Store1_Email':['Email@company1.com',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Email@company2.com','Email@company2.com','Email@company3.com','Email@company3.com',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Email@company4.com','Email@company4.com',np.nan,np.nan,np.nan,np.nan], 
    'Header 7':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 
    'Header 8':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 
    'Header 9':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 
    'Store2_Email':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 
    'Header 11':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 
    'Header 12':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 
    'Store3_Email':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 
    'Header 14':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 
    'Header 15':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 
    'Header 16':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 
    'Header 17':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 
    'Store4_Email':['Email2@company2.com','0',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Email2@company3.com','-',np.nan,np.nan,'-','Email2@company4.com',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 
    'Header 19':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]} 
df_merged_duplicates = pd.DataFrame(data)

print(df_merged_duplicates)

df_merged_duplicates = df_merged_duplicates.sort_values(by=['ID']) # sort ID column

# Store 1 emails, merge
cmnts = {}
for i, row in df_merged_duplicates.iterrows():
    while True:
        try:
            if row['Store1_Email']: # <- not working
                cmnts[row['ID']].append(row['Store1_Email'])
            else:
                cmnts[row['ID']].append(np.nan)

            break

        except KeyError:
            cmnts[row['ID']] = []

# Store 2 emails, merge
cmnts2 = {}
for i, row in df_merged_duplicates.iterrows():
    while True:
        try:
            if row['Store2_Email']: # <- not working
                cmnts2[row['ID']].append(row['Store2_Email'])
            else:
                cmnts2[row['ID']].append(np.nan)

            break

        except KeyError:
            cmnts2[row['ID']] = []

# Store 3 emails, merge
cmnts3 = {}
for i, row in df_merged_duplicates.iterrows():
    while True:
        try:           
            if row['Store3_Email']: # <- not working
                cmnts3[row['ID']].append(row['Store3_Email'])
            else:
                cmnts3[row['ID']].append(np.nan)

            break

        except KeyError:
            cmnts3[row['ID']] = []

# Store 4 emails, merge
cmnts4 = {}
for i, row in df_merged_duplicates.iterrows():
    while True:
        try:           
            if row['Store4_Email']: # <- not working
                cmnts4[row['ID']].append(row['Store4_Email'])
            else:
                cmnts4[row['ID']].append(np.nan)

            break

        except KeyError:
            cmnts4[row['ID']] = []

df_merged_duplicates.drop_duplicates('ID', inplace=True)
df_merged_duplicates['Store1_Email'] = [', '.join(map(str, v)) for v in cmnts.values()]
df_merged_duplicates['Store2_Email'] = [', '.join(map(str, v)) for v in cmnts2.values()]
df_merged_duplicates['Store3_Email'] = [', '.join(map(str, v)) for v in cmnts3.values()]
df_merged_duplicates['Store4_Email'] = [', '.join(map(str, v)) for v in cmnts4.values()]



print(df_merged_duplicates)
df_merged_duplicates.to_excel(path_data_sources + 'test_duplicates_ny.xlsx', index=False)

Answer

I would use a “split-apply-combine” approach. In pandas you can use the groupby function to do this and then apply a function to combine the email addresses to each group (in this case you can group by the ID col.

I wrote a function to combine the email addresses for a given column:

def combine_emails(series):
    strs = [s for s in series.astype(str).values if '@' in s]
    combined_emails = ",".join(strs)
    if combined_emails !='':
        return combined_emails
    else:
        return np.nan

Then I wrote a function to take the first row of each grouped dataframe and call the combine function on the email columns to populate the row email values:

def combine_duplicate_rows(df):
    first_row = df.iloc[0]
    for email_col in ['Store1_Email', 'Store2_Email', 'Store3_Email', 'Store4_Email']:
        first_row[email_col] = combine_emails(df[email_col])
    return first_row

Then you can apply the combine_duplicate_rows to your groups and you get the solution:

In [71]: df.groupby('ID').apply(combine_duplicate_rows)
Out[71]:
    ID Header 1  Header 2  Header 3  Header 4  Header 5                           Store1_Email  Header 9  Store2_Email  Header 12  Store3_Email  Header 17         Store4_Email
ID
1    1       AA       NaN       NaN       NaN       NaN                     Email@company1.com       NaN           NaN        NaN           NaN        NaN  Email2@company2.com
2    2       BB       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN
3    3       CC       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN
4    4       DD       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN
5    5       EE       NaN       NaN       NaN       NaN  Email@company2.com,Email@company2.com       NaN           NaN        NaN           NaN        NaN                  NaN
6    6       FF       NaN       NaN       NaN       NaN  Email@company3.com,Email@company3.com       NaN           NaN        NaN           NaN        NaN                  NaN
7    7       GG       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN
8    8       HH       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN
9    9       II       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN  Email2@company3.com
10  10       JJ       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN
11  11       KK       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN  Email2@company4.com
12  12       LL       NaN       NaN       NaN       NaN  Email@company4.com,Email@company4.com       NaN           NaN        NaN           NaN        NaN                  NaN
13  13       MM       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN
14  14       NN       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN

You then have a duplicate ID column, but you can just delete that

del df['ID']


Source: stackoverflow