I would like to join two files by key but I would like some columns to be joined together for example:
File1:
df1 = pd.DataFrame({'List' : ['P111', 'P999', 'P111;P999;P777', 'P555', 'P666;P111;P333'],
'Color' : ['red', 'red', 'blue','yellow', 'red']})
List Color
0 P111 red
1 P999 red
2 P111;P999;P777 blue
3 P555 yellow
4 P666;P111;P333 red
File2:
df2 = pd.DataFrame({'Cod' : ['P111', 'P222', 'P333', 'P444', 'P555', 'P666', 'P777'],
'Animal' : ['DOG', 'CAT', 'BUG','SNAKE,DOG', 'CAT,BUG', 'DOG', 'SNAKE'],
'Letter' : ['A,F', 'C', 'S,M', 'F,L', 'C,A','M,C', 'Z,L']})
Cod Animal Letter
0 P111 DOG A,F
1 P222 CAT C
2 P333 BUG S,M
3 P444 SNAKE,DOG F,L
4 P555 CAT,BUG C,A
5 P666 DOG M,C
6 P777 SNAKE Z,L
I would like to merge with primary key List (from file1) and Cod (from file2), to get:
List Color Animal Letter 0 P111 red DOG,FROG A,F 1 P999 red - - 2 P111;P999;P777 blue DOG,FROG|-|SNAKE A,F|-|Z,L 3 P555 yellow CAT,BUG C,A 4 P666;P111;P333 red DOG|DOG,FROG|BUG M,C|A,F|S,M
I think we need something like a left join and an agragation but I don’t know how. In the final table I would like that in the values not found there was a – . While I would like a | to separate values in “aggregated” columns
Which is the best way?
Advertisement
Answer
Idea is use DataFrame.explode by splitted values of List, then use left join and aggregate first values with join for expected ouput:
df = (df1.assign(Cod = df1['List'].str.split(';'))
.explode('Cod')
.reset_index()
.merge(df2, how='left', on='Cod')
.fillna('-')
.groupby('index')
.agg(List=('List','first'),
Color=('Color','first'),
Animal=('Animal','|'.join),
Letter=('Letter','|'.join))
.rename_axis(None))
print (df)
List Color Animal Letter
0 P111 red DOG A,F
1 P999 red - -
2 P111;P999;P777 blue DOG|-|SNAKE A,F|-|Z,L
3 P555 yellow CAT,BUG C,A
4 P666;P111;P333 red DOG|DOG|BUG M,C|A,F|S,M