I have several tables that contain lab results, with a ‘master’ table of sample data with things like a description. The results tables are also broken down by specimen (sub-samples). They contain multiple results columns – I’m just showing one here. I want to combine all the results tables into one dataframe, like this:
Table 1: Location Sample Description 1 A Yellow 1 B Red 2 A Blue 2 B Violet Table 2 Location Sample Specimen Result1 1 A X 5 1 A Y 6 1 B X 10 2 A X 1 Table 3 Location Sample Specimen Result2 1 A X "Heavy" 1 A Q "Soft" 1 B K "Grey" 2 B Z "Bananas" Desired Output: Location Sample Description Specimen Result1 Result2 1 A Yellow X 5 "Heavy" 1 A Yellow Y 6 nan 1 A Yellow Q nan "Soft" 1 B Red X 10 nan 1 B Red K nan "Grey" 2 A Blue X 1 nan 2 B Violet Z nan "Bananas"
I currently have a solution for this using iterrows() and df.append(), but these are both slow operations and when there are thousands of results it takes too long. Is there better way? I have tried using join() and merge() but I can’t seem to get the result I want.
Quick code to reproduce my dataframes:
dict1 = {'Location': [1,1,2,2], 'Sample': ['A','B','A','B'], 'Description': ['Yellow','Red','Blue','Violet']}
dict2 = {'Location': [1,1,1,2], 'Sample': ['A','A','B','A'], 'Specimen': ['x', 'y','x', 'x'], 'Result1': [5,6,10,1]}
dict3 = {'Location': [1,1,1,2], 'Sample': ['A','A','B','B'], 'Specimen': ['x', 'q','k', 'z'], 'Result2': ["Heavy","Soft","Grey","Bananas"]}
df1 = pd.DataFrame.from_dict(dict1)
df2 = pd.DataFrame.from_dict(dict2)
df3 = pd.DataFrame.from_dict(dict3)
Advertisement
Answer
First idea is join df2, df3 together by concat and for unique 'Location','Sample','Specimen' rows are rows aggregated by sum, last merge to df1:
df23 = (pd.concat([df2, df3])
.groupby(['Location','Sample','Specimen'], as_index=False, sort=False)
.sum(min_count=1))
df = df1.merge(df23, on=['Location','Sample'])
print (df)
Location Sample Description Specimen Result1 Result2
0 1 A Yellow x 5.0 4.0
1 1 A Yellow y 6.0 NaN
2 1 A Yellow q NaN 6.0
3 1 B Red x 10.0 NaN
4 1 B Red k NaN 8.0
5 2 A Blue x 1.0 NaN
6 2 B Violet z NaN 5.0
Or if all rows in df2,df3 per columns ['Location','Sample','Specimen'] are unique, solution is simplier:
df23 = pd.concat([df2.set_index(['Location','Sample','Specimen']),
df3.set_index(['Location','Sample','Specimen'])], axis=1)
df = df1.merge(df23.reset_index(), on=['Location','Sample'])
print (df)
Location Sample Description Specimen Result1 Result2
0 1 A Yellow q NaN 6.0
1 1 A Yellow x 5.0 4.0
2 1 A Yellow y 6.0 NaN
3 1 B Red k NaN 8.0
4 1 B Red x 10.0 NaN
5 2 A Blue x 1.0 NaN
6 2 B Violet z NaN 5.0
EDIT: With new data second solution working well:
df23 = pd.concat([df2.set_index(['Location','Sample','Specimen']),
df3.set_index(['Location','Sample','Specimen'])], axis=1)
df = df1.merge(df23.reset_index(), on=['Location','Sample'])
print (df)
Location Sample Description Specimen Result1 Result2
0 1 A Yellow q NaN Soft
1 1 A Yellow x 5.0 Heavy
2 1 A Yellow y 6.0 NaN
3 1 B Red k NaN Grey
4 1 B Red x 10.0 NaN
5 2 A Blue x 1.0 NaN
6 2 B Violet z NaN Bananas