Skip to content
Advertisement

Join two data frames by searching & matching exactly same strings

I have two python dataframes: one of them have a column has a row ‘AC-2’ another data frame includes columns which has a string row ‘AC-20,AC8,AC-15’

str_match = "({})".format("|".join(df1['column_first_dataframe']))
df2.merge(df2,how='left',left_on=df1['column_first_dataframe'].str.extract(str_match)[0], right_on="column_second_dataframe")

First dataset:

Column1 Column2
AC-2         2       
AC-20        1       
AC-15        3       
AC-1         2       
AC-5         5      
     

second dataset:

Column1        
AC-2,AC-5,AC-30      
AC-20,AC-30,AC11

I found:

Column1 Column2             
AC-2       AC-20,AC-30,AC11       
AC-2       AC-2,AC-5,AC-30   
AC-20      AC-20,AC-30,AC11        
AC-15        null      
AC-1         null      
AC-5        AC-2,AC-5,AC-30 

above there is a matching between AC-2 for dataset1 and AC-20 in string for dataset 2 but they are different thing in my dataset. my desired output:

Column1 Column2             
AC-2       AC-2,AC-5,AC-30      
AC-20      AC-20,AC-30,AC11        
AC-15        null      
AC-1         null      
AC-5       AC-2,AC-5,AC-30

I tried to merge data frames with searching and matching BUT my code merged AC-2 to AC-20. I don’t want to do that. Are there any searching and matching way to merge my data frames exactly what I want. I don’t want to merge AC-20 and AC-2. It should merge AC-2 and AC-2 with exactly same thing.

Thank you for your contributions!

Advertisement

Answer

A simple way is to split and explode the Column2 of df2 to get one row per full word and perform a simple left merge.

(df1
 .merge(df2.assign(group=df2['Column1'].str.split(','))
           .rename(columns={'Column1': 'Column3'})
           .explode('group'),
        left_on='Column1', right_on='group', how='left'
       )
 .drop(columns='group')
)

output:

  Column1  Column2           Column3
0    AC-2        2   AC-2,AC-5,AC-30
1   AC-20        1  AC-20,AC-30,AC11
2   AC-15        3               NaN
3    AC-1        2               NaN
4    AC-5        5   AC-2,AC-5,AC-30
alternative

NB. I am producing above a slightly different output to keep all columns. If really you want your output, simply drop “Column2” in df1 and rename “Column1” into “Column2” in df2

(df1
 .drop(columns='Column2')
 .merge(df2.rename(columns={'Column1': 'Column2'})
           .assign(Column1=df2['Column1'].str.split(','))
           .explode('Column1'),
        on='Column1', how='left'
       )
)

output:

  Column1           Column2
0    AC-2   AC-2,AC-5,AC-30
1   AC-20  AC-20,AC-30,AC11
2   AC-15               NaN
3    AC-1               NaN
4    AC-5   AC-2,AC-5,AC-30
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement