pandas, merge duplicates if row contains wildcard text

Question

I have a dataset of duplicates (ID). Dataset contains both information and emails. I&#8217;m trying to concatenate the emails (if row have character @) and then remove the duplicates. My original dataset: What I wish to accomplish: My current code is a modification of Eric Ed Lohmar code and give the followin…

Accepted Answer

I would use a &#8220;split-apply-combine&#8221; approach. In pandas you can use the groupby function to do this and then apply a function to combine the email addresses to each group (in this case you can group by the ID col.I wrote a function to combine the email addresses for a given column:def combine_emails(series):    strs = [s for s in series.astype(str).values if '@' in s]    combined_emails = ",".join(strs)    if combined_emails !='':        return combined_emails    else:        return np.nanThen I wrote a function to take the first row of each grouped dataframe and call the combine function on the email columns to populate the row email values:def combine_duplicate_rows(df):    first_row = df.iloc[0]    for email_col in ['Store1_Email', 'Store2_Email', 'Store3_Email', 'Store4_Email']:        first_row[email_col] = combine_emails(df[email_col])    return first_rowThen you can apply the combine_duplicate_rows to your groups and you get the solution:In [71]: df.groupby('ID').apply(combine_duplicate_rows)Out[71]:    ID Header 1  Header 2  Header 3  Header 4  Header 5                           Store1_Email  Header 9  Store2_Email  Header 12  Store3_Email  Header 17         Store4_EmailID1    1       AA       NaN       NaN       NaN       NaN                     Email@company1.com       NaN           NaN        NaN           NaN        NaN  Email2@company2.com2    2       BB       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN3    3       CC       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN4    4       DD       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN5    5       EE       NaN       NaN       NaN       NaN  Email@company2.com,Email@company2.com       NaN           NaN        NaN           NaN        NaN                  NaN6    6       FF       NaN       NaN       NaN       NaN  Email@company3.com,Email@company3.com       NaN           NaN        NaN           NaN        NaN                  NaN7    7       GG       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN8    8       HH       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN9    9       II       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN  Email2@company3.com10  10       JJ       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN11  11       KK       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN  Email2@company4.com12  12       LL       NaN       NaN       NaN       NaN  Email@company4.com,Email@company4.com       NaN           NaN        NaN           NaN        NaN                  NaN13  13       MM       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN14  14       NN       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaNYou then have a duplicate ID column, but you can just delete thatdel df['ID']

Advertisement

Answer