Skip to content
Advertisement

merging on pandas: reduce the set of merging variables when match is not possible

Using python, I want to merge on multiple variables; A, B, C, but when realization a-b-c in one dataset is missing, use the finer combination that the observation has (like b-c).

Example:

Suppose I have a dataset (df1) containing person’s characteristics (gender, married, city). And another dataset (df2) that I have the median income of a person according to their gender, city, married (created with a groupby).

Then I want to input that median income into the first dataset (df1) matching in as many characterisics as possible. That is if individual has characteristics gender-city-married that has median income, use that value. If the individual has characteristics that there is only city-married median income, to use that value.

Something like that

df1 = pd.DataFrame({'Male':['0', '0', '1','1'],'Married':['0', '1', '0','1'], 'City': ['NY', 'NY', 'NY', 'NY']})
  Male Married City
    0       0   NY
    0       1   NY
    1       0   NY
    1       1   NY

df2 = pd.DataFrame({'Male':['0', '0', '1'],'Married':['0', '1', '1'], 'City': ['NY', 'NY','NY'], 'income':['300','400', '500']})

  Male Married City income
    0       0   NY    300
    0       1   NY    400
    1       1   NY    500
'''
and the desired outcome:
'''
desired_df1:
  Male Married City income
   0      0    NY    300
   0      1    NY    400
   1      0    NY    300
   1      1    NY    400

I was thinking to do a 1st merge by=['male','married','city'], and then fill missing values from a 2nd merge by=['married','city']. But I think there should be a more systematic and simpler way. Any suggestions?

Thanks and sorry if formulation is not correct or it is duplicate (I look deeply and didn’t find anything).

Advertisement

Answer

You can do a groupby and fillna too after merging:

out = df1.merge(df2,on=['Male','Married','City'],how='left')
out['income'] = (out['income'].fillna(out.groupby(['Married','City'])['income']
                                                     .fillna(method='ffill')))

print(out)

  Male Married City income
0    0       0   NY    300
1    0       1   NY    400
2    1       0   NY    300
3    1       1   NY    500 # <- Note that this should be 500 not 400
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement