I have two dataframes, df1
and df2
, and I know that df2
is a subset of df1
. What I am trying to do is find the set difference between df1
and df2
, such that df1
has only entries that are different from those in df2
. To accomplish this, I first used pandas.util.hash_pandas_object
on each of the dataframes, and then found the set difference between the two hashed columns.
df1['hash'] = pd.util.hash_pandas_object(df1, index=False) df2['hash'] = pd.util.hash_pandas_object(df2, index=False) df1 = df1.loc[~df1['hash'].isin(df2['hash'])]
This results in df1
remaining the same size; that is, none of the hash values matched. However, when I use a lambda
function, df1
is reduced by the expected amount.
df1['hash'] = df1.apply(lambda x: hash(tuple(x)), axis=1) df2['hash'] = df2.apply(lambda x: hash(tuple(x)), axis=1) df1 = df1.loc[~df1['hash'].isin(df2['hash'])]
The problem with the second approach is that it takes an extremely long time to execute (df1
has about 3 million rows). Am I just misunderstanding how to use pandas.util.hash_pandas_object
?
Advertisement
Answer
The difference is that in the first case you are hashing the complete dataframe, while in the second case you are hashing each individual row.
If your object is to remove the duplicate rows, you can achieve this faster using left/right merge with indicator
option and then drop the rows that are not unique to the original dataframe.
df_merged = df1.merge(df2, how='left', on=list_columns, indicator=True) df_merged = df_merged[df_merged.indicator=="left_only"] # this will keep only unmatched rows