I would like to join two files by key but I would like some columns to be joined together for example:
File1:
df1 = pd.DataFrame({'List' : ['P111', 'P999', 'P111;P999;P777', 'P555', 'P666;P111;P333'], 'Color' : ['red', 'red', 'blue','yellow', 'red']}) List Color 0 P111 red 1 P999 red 2 P111;P999;P777 blue 3 P555 yellow 4 P666;P111;P333 red
File2:
df2 = pd.DataFrame({'Cod' : ['P111', 'P222', 'P333', 'P444', 'P555', 'P666', 'P777'], 'Animal' : ['DOG', 'CAT', 'BUG','SNAKE,DOG', 'CAT,BUG', 'DOG', 'SNAKE'], 'Letter' : ['A,F', 'C', 'S,M', 'F,L', 'C,A','M,C', 'Z,L']}) Cod Animal Letter 0 P111 DOG A,F 1 P222 CAT C 2 P333 BUG S,M 3 P444 SNAKE,DOG F,L 4 P555 CAT,BUG C,A 5 P666 DOG M,C 6 P777 SNAKE Z,L
I would like to merge with primary key List (from file1) and Cod (from file2), to get:
List Color Animal Letter 0 P111 red DOG,FROG A,F 1 P999 red - - 2 P111;P999;P777 blue DOG,FROG|-|SNAKE A,F|-|Z,L 3 P555 yellow CAT,BUG C,A 4 P666;P111;P333 red DOG|DOG,FROG|BUG M,C|A,F|S,M
I think we need something like a left join and an agragation but I don’t know how. In the final table I would like that in the values not found there was a – . While I would like a | to separate values in “aggregated” columns
Which is the best way?
Advertisement
Answer
Idea is use DataFrame.explode
by splitted values of List
, then use left join and aggregate first values with join
for expected ouput:
df = (df1.assign(Cod = df1['List'].str.split(';')) .explode('Cod') .reset_index() .merge(df2, how='left', on='Cod') .fillna('-') .groupby('index') .agg(List=('List','first'), Color=('Color','first'), Animal=('Animal','|'.join), Letter=('Letter','|'.join)) .rename_axis(None)) print (df) List Color Animal Letter 0 P111 red DOG A,F 1 P999 red - - 2 P111;P999;P777 blue DOG|-|SNAKE A,F|-|Z,L 3 P555 yellow CAT,BUG C,A 4 P666;P111;P333 red DOG|DOG|BUG M,C|A,F|S,M