I have a large data frame df
and a small data frame df_right
with 2 columns a
and b
. I want to do a simple left join / lookup on a
without copying df
.
I come up with this code but I am not sure how robust it is:
dtmp = pd.merge(df[['a']], df_right, on = 'a', how = "left") #one col left join df['b'] = dtmp['b'].values
I know it certainly fails when there are duplicated keys: pandas left join – why more results?
Is there better way to do this?
Related:
Outer merging two data frames in place in pandas
What are the exact downsides of copy=False in DataFrame.merge()?
Advertisement
Answer
You are almost there. There are 4 cases to consider:
- Both
df
anddf_right
do not have duplicated keys - Only
df
has duplicated keys - Only
df_right
has duplicated keys - Both
df
anddf_right
have duplicated keys
Your code fails in case 3 & 4 since the merging extends the number of row count in df
. In order to make it work, you need to choose what information to drop in df_right
prior to merging. The purpose of this is to enforce any merging scheme to be either case 1 or 2.
For example, if you wish to keep “first” values for each duplicated key in df_right
, the following code works for all 4 cases above.
dtmp = pd.merge(df[['a']], df_right.drop_duplicates('a', keep='first'), on='a', how='left') df['b'] = dtmp['b'].values
Alternatively, if column 'b'
of df_right
consists of numeric values and you wish to have summary statistic:
dtmp = pd.merge(df[['a']], df_right.groupby('a').mean().reset_index(drop=False), on='a', how='left') df['b'] = dtmp['b'].values