I have several tables that contain lab results, with a ‘master’ table of sample data with things like a description. The results tables are also broken down by specimen (sub-samples). They contain multiple results columns – I’m just showing one here. I want to combine all the results tables into one dataframe, like this:
Table 1:
Location Sample Description
1 A Yellow
1 B Red
2 A Blue
2 B Violet
Table 2
Location Sample Specimen Result1
1 A X 5
1 A Y 6
1 B X 10
2 A X 1
Table 3
Location Sample Specimen Result2
1 A X "Heavy"
1 A Q "Soft"
1 B K "Grey"
2 B Z "Bananas"
Desired Output:
Location Sample Description Specimen Result1 Result2
1 A Yellow X 5 "Heavy"
1 A Yellow Y 6 nan
1 A Yellow Q nan "Soft"
1 B Red X 10 nan
1 B Red K nan "Grey"
2 A Blue X 1 nan
2 B Violet Z nan "Bananas"
I currently have a solution for this using iterrows()
and df.append()
, but these are both slow operations and when there are thousands of results it takes too long. Is there better way? I have tried using join()
and merge()
but I can’t seem to get the result I want.
Quick code to reproduce my dataframes:
dict1 = {'Location': [1,1,2,2], 'Sample': ['A','B','A','B'], 'Description': ['Yellow','Red','Blue','Violet']}
dict2 = {'Location': [1,1,1,2], 'Sample': ['A','A','B','A'], 'Specimen': ['x', 'y','x', 'x'], 'Result1': [5,6,10,1]}
dict3 = {'Location': [1,1,1,2], 'Sample': ['A','A','B','B'], 'Specimen': ['x', 'q','k', 'z'], 'Result2': ["Heavy","Soft","Grey","Bananas"]}
df1 = pd.DataFrame.from_dict(dict1)
df2 = pd.DataFrame.from_dict(dict2)
df3 = pd.DataFrame.from_dict(dict3)
Advertisement
Answer
First idea is join df2, df3
together by concat
and for unique 'Location','Sample','Specimen'
rows are rows aggregated by sum
, last merge to df1
:
df23 = (pd.concat([df2, df3])
.groupby(['Location','Sample','Specimen'], as_index=False, sort=False)
.sum(min_count=1))
df = df1.merge(df23, on=['Location','Sample'])
print (df)
Location Sample Description Specimen Result1 Result2
0 1 A Yellow x 5.0 4.0
1 1 A Yellow y 6.0 NaN
2 1 A Yellow q NaN 6.0
3 1 B Red x 10.0 NaN
4 1 B Red k NaN 8.0
5 2 A Blue x 1.0 NaN
6 2 B Violet z NaN 5.0
Or if all rows in df2,df3
per columns ['Location','Sample','Specimen']
are unique, solution is simplier:
df23 = pd.concat([df2.set_index(['Location','Sample','Specimen']),
df3.set_index(['Location','Sample','Specimen'])], axis=1)
df = df1.merge(df23.reset_index(), on=['Location','Sample'])
print (df)
Location Sample Description Specimen Result1 Result2
0 1 A Yellow q NaN 6.0
1 1 A Yellow x 5.0 4.0
2 1 A Yellow y 6.0 NaN
3 1 B Red k NaN 8.0
4 1 B Red x 10.0 NaN
5 2 A Blue x 1.0 NaN
6 2 B Violet z NaN 5.0
EDIT: With new data second solution working well:
df23 = pd.concat([df2.set_index(['Location','Sample','Specimen']),
df3.set_index(['Location','Sample','Specimen'])], axis=1)
df = df1.merge(df23.reset_index(), on=['Location','Sample'])
print (df)
Location Sample Description Specimen Result1 Result2
0 1 A Yellow q NaN Soft
1 1 A Yellow x 5.0 Heavy
2 1 A Yellow y 6.0 NaN
3 1 B Red k NaN Grey
4 1 B Red x 10.0 NaN
5 2 A Blue x 1.0 NaN
6 2 B Violet z NaN Bananas