I have a similar question as here Comparing two data frames and getting the differences But columns in df1 is a subset of columns in df2. I would like to get the difference the two df by comparing those columns in common only. So the result I expect to get is Is there a way to do so? Any help

Comparing two data frames with different columns and getting the differences

I have a similar question as here Comparing two data frames and getting the differences But columns in df1 is a subset of columns in df2.

df1:
Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green

df2:
Date       Fruit  Num  Color  A
2013-11-24 Banana 22.1 Yellow 1 
2013-11-24 Orange  8.6 Orange 2
2013-11-24 Apple   7.6 Green  3 
2013-11-24 Celery 10.2 Green  4
2013-11-25 Apple  22.1 Red    5
2013-11-25 Orange  8.6 Orange 6

JavaScript
​x
 
df1:
Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green
​
df2:
Date       Fruit  Num  Color  A
2013-11-24 Banana 22.1 Yellow 1 
2013-11-24 Orange  8.6 Orange 2
2013-11-24 Apple   7.6 Green  3 
2013-11-24 Celery 10.2 Green  4
2013-11-25 Apple  22.1 Red    5
2013-11-25 Orange  8.6 Orange 6
​

I would like to get the difference the two df by comparing those columns in common only. So the result I expect to get is

         Date   Fruit   Num   Color A
4  2013-11-25   Apple  22.1     Red 5
5  2013-11-25  Orange   8.6  Orange 6

JavaScript
 
         Date   Fruit   Num   Color A
4  2013-11-25   Apple  22.1     Red 5
5  2013-11-25  Orange   8.6  Orange 6
​

Is there a way to do so? Any help is appreciated.

Answer

First you get the column names of df1

df1_columns = df1.columns # ["Date", "Fruit", "Num", "Color"]

JavaScript
 
df1_columns = df1.columns # ["Date", "Fruit", "Num", "Color"]
​

Now you create a new df2 dataframe with only df1 columns

df2_filtered = df2[df1_columns]

JavaScript
 
df2_filtered = df2[df1_columns]
​

And now you can apply the solution from this other question.

#concatenate both dataframes
df = pd.concat([df1, df2_filtered])
df = df.reset_index(drop=True)

#group by
df_gpby = df.groupby(list(df.columns))

# get index of unique records
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

#filter
df.reindex(idx)

JavaScript
 
#concatenate both dataframes
df = pd.concat([df1, df2_filtered])
df = df.reset_index(drop=True)
​
#group by
df_gpby = df.groupby(list(df.columns))
​
# get index of unique records
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
​
#filter
df.reindex(idx)
​

Hope it helps!

Advertisement

Answer