Let’s say I have 2 dataframes, both have different lengths but the same amount of columns
df1 = pd.DataFrame({'country': ['Russia','Mexico','USA','Argentina','Denmark','Syngapore'], 'population': [41,12,26,64,123,24]}) df2 = pd.DataFrame({'country': ['Russia','Argentina','Australia','USA'], 'population': [44,12,23,64]})
Lets assume that some of the data in df1
is outdated and I’ve received a new dataframe that contains some new data but not which may or may not exist already in the outdated dataframe.
I want to find out if any of the values of df2.country
are inside df1.country
By doing the following I’m able to return a boolean:
df = df1.country.isin(df2.country) print(df)
Unfortunately I’m just creating a new dataframe containing the answer to my question
0 True 1 False 2 True 3 True 4 False 5 False Name: country, dtype: bool
My goal here is to delete the rows of df1
which values match with df2
and add the new data, kind of like an update.
I’ve manage to come up with something like this:
df = df1.country.isin(df2.country) i = 0 for x in df: if x: df1.drop(i, inplace=True) i += 1 frames = [df1, df2] df1 = pd.concat(frames) df1.reset_index(drop=True, inplace=True) print(df1)
which in fact works and updates the dataframe
country population 0 Mexico 12 1 Denmark 123 2 Syngapore 24 3 Russia 44 4 Argentina 12 5 Australia 23 6 USA 64
But I really believe there’s a batter way of doing the same thing quicker and much more practical considering that the real dataframe is much bigger and updates every few seconds.
I’d love to hear some suggestions, Thanks!
Advertisement
Answer
The isin
approach is so close! Simply use the results from isin
as a mask, then concat
the rows from df1
that are not in (~
) df2
with the rest of df2
:
m = df1['country'].isin(df2['country']) df3 = pd.concat((df1[~m], df2), ignore_index=True)
df3
:
country population 0 Mexico 12 1 Denmark 123 2 Syngapore 24 3 Russia 44 4 Argentina 12 5 Australia 23 6 USA 64