Skip to content
Advertisement

Pandas left join in place

I have a large data frame df and a small data frame df_right with 2 columns a and b. I want to do a simple left join / lookup on a without copying df.

I come up with this code but I am not sure how robust it is:

dtmp = pd.merge(df[['a']], df_right, on = 'a', how = "left")  #one col left join
df['b'] = dtmp['b'].values

I know it certainly fails when there are duplicated keys: pandas left join – why more results?

Is there better way to do this?

Related:

Outer merging two data frames in place in pandas

What are the exact downsides of copy=False in DataFrame.merge()?

Advertisement

Answer

You are almost there. There are 4 cases to consider:

  1. Both df and df_right do not have duplicated keys
  2. Only df has duplicated keys
  3. Only df_right has duplicated keys
  4. Both df and df_right have duplicated keys

Your code fails in case 3 & 4 since the merging extends the number of row count in df. In order to make it work, you need to choose what information to drop in df_right prior to merging. The purpose of this is to enforce any merging scheme to be either case 1 or 2.

For example, if you wish to keep “first” values for each duplicated key in df_right, the following code works for all 4 cases above.

dtmp = pd.merge(df[['a']], df_right.drop_duplicates('a', keep='first'), on='a', how='left')
df['b'] = dtmp['b'].values

Alternatively, if column 'b' of df_right consists of numeric values and you wish to have summary statistic:

dtmp = pd.merge(df[['a']], df_right.groupby('a').mean().reset_index(drop=False), on='a', how='left')
df['b'] = dtmp['b'].values
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement