Pandas `hash_pandas_object` not producing duplicate hash values for duplicate entires

Question

I have two dataframes, df1 and df2, and I know that df2 is a subset of df1. What I am trying to do is find the set difference between df1 and df2, such that df1 has only entries that are different from those in df2. To accomplish this, I first used pandas.util.hash_pandas_object on each of the dataframes, and then found

Accepted Answer

The difference is that in the first case you are hashing the complete dataframe, while in the second case you are hashing each individual row.If your object is to remove the duplicate rows, you can achieve this faster using left/right merge with indicator option and then drop the rows that are not unique to the original dataframe.df_merged = df1.merge(df2, how='left', on=list_columns, indicator=True)df_merged = df_merged[df_merged.indicator=="left_only"] # this will keep only unmatched rows

Advertisement

Answer