Skip to content
Advertisement

Check if values in one dataframe match values from another, updating dataframe

Let’s say I have 2 dataframes, both have different lengths but the same amount of columns

df1 = pd.DataFrame({'country': ['Russia','Mexico','USA','Argentina','Denmark','Syngapore'], 
'population': [41,12,26,64,123,24]})

df2 = pd.DataFrame({'country': ['Russia','Argentina','Australia','USA'], 
'population': [44,12,23,64]})

Lets assume that some of the data in df1 is outdated and I’ve received a new dataframe that contains some new data but not which may or may not exist already in the outdated dataframe.

I want to find out if any of the values of df2.country are inside df1.country

By doing the following I’m able to return a boolean:

df = df1.country.isin(df2.country)

print(df)

Unfortunately I’m just creating a new dataframe containing the answer to my question

0     True
1    False
2     True
3     True
4    False
5    False
Name: country, dtype: bool

My goal here is to delete the rows of df1 which values match with df2 and add the new data, kind of like an update.

I’ve manage to come up with something like this:

df = df1.country.isin(df2.country)
i = 0
for x in df:
    if x:
        df1.drop(i, inplace=True)
    i += 1
frames = [df1, df2]
df1 = pd.concat(frames)
df1.reset_index(drop=True, inplace=True)
print(df1)

which in fact works and updates the dataframe

     country  population
0     Mexico          12
1    Denmark         123
2  Syngapore          24
3     Russia          44
4  Argentina          12
5  Australia          23
6        USA          64

But I really believe there’s a batter way of doing the same thing quicker and much more practical considering that the real dataframe is much bigger and updates every few seconds.

I’d love to hear some suggestions, Thanks!

Advertisement

Answer

The isin approach is so close! Simply use the results from isin as a mask, then concat the rows from df1 that are not in (~) df2 with the rest of df2:

m = df1['country'].isin(df2['country'])
df3 = pd.concat((df1[~m], df2), ignore_index=True)

df3:

     country  population
0     Mexico          12
1    Denmark         123
2  Syngapore          24
3     Russia          44
4  Argentina          12
5  Australia          23
6        USA          64
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement