Skip to content
Advertisement

Pandas merging/joining tables with multiple key columns and duplicating rows where necessary

I have several tables that contain lab results, with a ‘master’ table of sample data with things like a description. The results tables are also broken down by specimen (sub-samples). They contain multiple results columns – I’m just showing one here. I want to combine all the results tables into one dataframe, like this:

Table 1:
Location    Sample   Description
1           A        Yellow
1           B        Red
2           A        Blue
2           B        Violet

Table 2
Location  Sample    Specimen   Result1
1         A         X          5
1         A         Y          6
1         B         X          10
2         A         X          1

Table 3
Location  Sample    Specimen   Result2
1         A         X          "Heavy"
1         A         Q          "Soft"
1         B         K          "Grey"
2         B         Z          "Bananas"

Desired Output:
Location  Sample  Description  Specimen  Result1  Result2
1         A       Yellow       X         5        "Heavy"
1         A       Yellow       Y         6        nan
1         A       Yellow       Q         nan      "Soft"
1         B       Red          X         10       nan
1         B       Red          K         nan      "Grey"
2         A       Blue         X         1        nan
2         B       Violet       Z         nan      "Bananas"      

I currently have a solution for this using iterrows() and df.append(), but these are both slow operations and when there are thousands of results it takes too long. Is there better way? I have tried using join() and merge() but I can’t seem to get the result I want.

Quick code to reproduce my dataframes:

dict1 = {'Location': [1,1,2,2], 'Sample': ['A','B','A','B'], 'Description': ['Yellow','Red','Blue','Violet']}
dict2 = {'Location': [1,1,1,2], 'Sample': ['A','A','B','A'], 'Specimen': ['x', 'y','x', 'x'], 'Result1': [5,6,10,1]}
dict3 = {'Location': [1,1,1,2], 'Sample': ['A','A','B','B'], 'Specimen': ['x', 'q','k', 'z'], 'Result2': ["Heavy","Soft","Grey","Bananas"]}

df1 = pd.DataFrame.from_dict(dict1)
df2 = pd.DataFrame.from_dict(dict2)
df3 = pd.DataFrame.from_dict(dict3)

Advertisement

Answer

First idea is join df2, df3 together by concat and for unique 'Location','Sample','Specimen' rows are rows aggregated by sum, last merge to df1:

df23 = (pd.concat([df2, df3])
          .groupby(['Location','Sample','Specimen'], as_index=False, sort=False)
          .sum(min_count=1))
df = df1.merge(df23, on=['Location','Sample'])
print (df)
   Location Sample Description Specimen  Result1  Result2
0         1      A      Yellow        x      5.0      4.0
1         1      A      Yellow        y      6.0      NaN
2         1      A      Yellow        q      NaN      6.0
3         1      B         Red        x     10.0      NaN
4         1      B         Red        k      NaN      8.0
5         2      A        Blue        x      1.0      NaN
6         2      B      Violet        z      NaN      5.0

Or if all rows in df2,df3 per columns ['Location','Sample','Specimen'] are unique, solution is simplier:

df23 = pd.concat([df2.set_index(['Location','Sample','Specimen']), 
                  df3.set_index(['Location','Sample','Specimen'])], axis=1)

df = df1.merge(df23.reset_index(), on=['Location','Sample'])
print (df)
   Location Sample Description Specimen  Result1  Result2
0         1      A      Yellow        q      NaN      6.0
1         1      A      Yellow        x      5.0      4.0
2         1      A      Yellow        y      6.0      NaN
3         1      B         Red        k      NaN      8.0
4         1      B         Red        x     10.0      NaN
5         2      A        Blue        x      1.0      NaN
6         2      B      Violet        z      NaN      5.0

EDIT: With new data second solution working well:

df23 = pd.concat([df2.set_index(['Location','Sample','Specimen']), 
                  df3.set_index(['Location','Sample','Specimen'])], axis=1)

df = df1.merge(df23.reset_index(), on=['Location','Sample'])
print (df)        
   Location Sample Description Specimen  Result1  Result2
0         1      A      Yellow        q      NaN     Soft
1         1      A      Yellow        x      5.0    Heavy
2         1      A      Yellow        y      6.0      NaN
3         1      B         Red        k      NaN     Grey
4         1      B         Red        x     10.0      NaN
5         2      A        Blue        x      1.0      NaN
6         2      B      Violet        z      NaN  Bananas
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement