I have two dataframes like as below
ID,Name,Sub,Country 1,ABC,ENG,UK 1,ABC,MATHS,UK 1,ABC,Science,UK 2,ABE,ENG,USA 2,ABE,MATHS,USA 2,ABE,Science,USA 3,ABF,ENG,IND 3,ABF,MATHS,IND 3,ABF,Science,IND df1 = pd.read_clipboard(sep=',') ID,Name,class,age 11,ABC,ENG,21 12,ABC,MATHS,23 1,ABC,Science,25 22,ABE,ENG,19 23,ABE,MATHS,22 24,ABE,Science,26 33,ABF,ENG,24 31,ABF,MATHS,28 32,ABF,Science,26 df2 = pd.read_clipboard(sep=',')
I would like to do the below
a) Check whether the ID
and Name
from df1
is present in df2
.
b) If present in df2
, put Yes
in Status column or No
in Status column. Don’t use ~
or not in
operator because my df2 has million of rows. So, it will result in irrelevant results
I tried the below
ID_list = df1['ID'].unique().tolist() Name_list = df1['Name'].unique().tolist() filtered_df = df2[((df2['ID'].isin(ID_list)) & (df2['Name'].isin(Name_list)))] filtered_df = filtered_df.groupby(['ID','Name','Sub']).size().reset_index()
The above code gives matching ids and names between df1
and df2
.
But I want to find the ids
and names
that are present in df1
but missing/absent in df2
. I cannot use the ~
operator because it will return all the rows from df2
that don’t have a match in df1
. In real world, my df2 has millions of rows. I only want to find the missing df1 ids and names and put a status column
I expect my output to be like as below
ID,Name,Sub,Country, Status 1,ABC,ENG,UK,No 1,ABC,MATHS,UK,No 1,ABC,Science,UK,Yes 2,ABE,ENG,USA,No 2,ABE,MATHS,USA,No 2,ABE,Science,USA,No 3,ABF,ENG,IND,No 3,ABF,MATHS,IND,No 3,ABF,Science,IND,No
Advertisement
Answer
Expected ouput is for match by 3 columns:
m = df1.merge(df2, left_on=['ID','Name','Sub'], right_on=['ID','Name','class'], indicator=True, how='left')['_merge'].eq('both') df1['Status'] = np.where(m, 'Yes', 'No') print (df1) ID Name Sub Country Status 0 1 ABC ENG UK No 1 1 ABC MATHS UK No 2 1 ABC Science UK Yes 3 2 ABE ENG USA No 4 2 ABE MATHS USA No 5 2 ABE Science USA No 6 3 ABF ENG IND No 7 3 ABF MATHS IND No 8 3 ABF Science IND No
With testing by isin
solution is:
idx1 = pd.MultiIndex.from_frame(df1[['ID','Name','Sub']]) idx2 = pd.MultiIndex.from_frame(df2[['ID','Name','class']]) df1['Status'] = np.where(idx1.isin(idx2), 'Yes', 'No') print (df1) ID Name Sub Country Status 0 1 ABC ENG UK No 1 1 ABC MATHS UK No 2 1 ABC Science UK Yes 3 2 ABE ENG USA No 4 2 ABE MATHS USA No 5 2 ABE Science USA No 6 3 ABF ENG IND No 7 3 ABF MATHS IND No 8 3 ABF Science IND No
Because if match by 2 columns ouput is different:
m = df1.merge(df2, on=['ID','Name'], indicator=True, how='left')['_merge'].eq('both') df1['Status'] = np.where(m, 'Yes', 'No') print (df1) ID Name Sub Country Status 0 1 ABC ENG UK Yes 1 1 ABC MATHS UK Yes 2 1 ABC Science UK Yes 3 2 ABE ENG USA No 4 2 ABE MATHS USA No 5 2 ABE Science USA No 6 3 ABF ENG IND No 7 3 ABF MATHS IND No 8 3 ABF Science IND No