I have several tables that contain lab results, with a ‘master’ table of sample data with things like a description. The results tables are also broken down by specimen (sub-samples). They contain multiple results columns – I’m just showing one here. I want to combine all the results tables into one dataframe, like this:
Table 1: Location Sample Description 1 A Yellow 1 B Red 2 A Blue 2 B Violet Table 2 Location Sample Specimen Result1 1 A X 5 1 A Y 6 1 B X 10 2 A X 1 Table 3 Location Sample Specimen Result2 1 A X "Heavy" 1 A Q "Soft" 1 B K "Grey" 2 B Z "Bananas" Desired Output: Location Sample Description Specimen Result1 Result2 1 A Yellow X 5 "Heavy" 1 A Yellow Y 6 nan 1 A Yellow Q nan "Soft" 1 B Red X 10 nan 1 B Red K nan "Grey" 2 A Blue X 1 nan 2 B Violet Z nan "Bananas"
I currently have a solution for this using iterrows()
and df.append()
, but these are both slow operations and when there are thousands of results it takes too long. Is there better way? I have tried using join()
and merge()
but I can’t seem to get the result I want.
Quick code to reproduce my dataframes:
dict1 = {'Location': [1,1,2,2], 'Sample': ['A','B','A','B'], 'Description': ['Yellow','Red','Blue','Violet']} dict2 = {'Location': [1,1,1,2], 'Sample': ['A','A','B','A'], 'Specimen': ['x', 'y','x', 'x'], 'Result1': [5,6,10,1]} dict3 = {'Location': [1,1,1,2], 'Sample': ['A','A','B','B'], 'Specimen': ['x', 'q','k', 'z'], 'Result2': ["Heavy","Soft","Grey","Bananas"]} df1 = pd.DataFrame.from_dict(dict1) df2 = pd.DataFrame.from_dict(dict2) df3 = pd.DataFrame.from_dict(dict3)
Advertisement
Answer
First idea is join df2, df3
together by concat
and for unique 'Location','Sample','Specimen'
rows are rows aggregated by sum
, last merge to df1
:
df23 = (pd.concat([df2, df3]) .groupby(['Location','Sample','Specimen'], as_index=False, sort=False) .sum(min_count=1)) df = df1.merge(df23, on=['Location','Sample']) print (df) Location Sample Description Specimen Result1 Result2 0 1 A Yellow x 5.0 4.0 1 1 A Yellow y 6.0 NaN 2 1 A Yellow q NaN 6.0 3 1 B Red x 10.0 NaN 4 1 B Red k NaN 8.0 5 2 A Blue x 1.0 NaN 6 2 B Violet z NaN 5.0
Or if all rows in df2,df3
per columns ['Location','Sample','Specimen']
are unique, solution is simplier:
df23 = pd.concat([df2.set_index(['Location','Sample','Specimen']), df3.set_index(['Location','Sample','Specimen'])], axis=1) df = df1.merge(df23.reset_index(), on=['Location','Sample']) print (df) Location Sample Description Specimen Result1 Result2 0 1 A Yellow q NaN 6.0 1 1 A Yellow x 5.0 4.0 2 1 A Yellow y 6.0 NaN 3 1 B Red k NaN 8.0 4 1 B Red x 10.0 NaN 5 2 A Blue x 1.0 NaN 6 2 B Violet z NaN 5.0
EDIT: With new data second solution working well:
df23 = pd.concat([df2.set_index(['Location','Sample','Specimen']), df3.set_index(['Location','Sample','Specimen'])], axis=1) df = df1.merge(df23.reset_index(), on=['Location','Sample']) print (df) Location Sample Description Specimen Result1 Result2 0 1 A Yellow q NaN Soft 1 1 A Yellow x 5.0 Heavy 2 1 A Yellow y 6.0 NaN 3 1 B Red k NaN Grey 4 1 B Red x 10.0 NaN 5 2 A Blue x 1.0 NaN 6 2 B Violet z NaN Bananas