Pandas merge indexing not behaving as expected

I am trying to perform an anti-join in effectively one line. However, my one line solution is not giving me the same results that a receive when breaking up the code into two lines (which behaves as expected). Specifically, the single-line solution results in a dataframe with fewer rows.

The goal of my anti-join is to remove any overlap of the UNIQUE_IDs of df and df_subset1. The result will be df_subset2.

Single line:

df_subset2 = df.loc[df.merge(df_subset1['UNIQUE_ID'], on = 'UNIQUE_ID', how = 'outer', indicator = True)._merge == 'left_only']

Multi line:

df_subset2 = df.merge(df_subset1['UNIQUE_ID'], on = 'UNIQUE_ID', how = 'outer', indicator = True)
df_subset2 = df_subset2.loc[df_subset2._merge == 'left_only']

These yield different results, but from my perspective the single line should still work as intended. Does merging change the index of the underlying dataframe? That would be the only reason I can think of which would cause the .loc operation to select differently.

Answer

To get one-line solution, you can try chaining your df.merge() call with a .query() call like below:

df_subset2 = df.merge(df_subset1['UNIQUE_ID'], on = 'UNIQUE_ID', how = 'outer', indicator = True).query('_merge == "left_only"')

Advertisement

Answer