I have two dataframes like as below
ID,Name,Sub,Country 1,ABC,ENG,UK 1,ABC,MATHS,UK 1,ABC,Science,UK 2,ABE,ENG,USA 2,ABE,MATHS,USA 2,ABE,Science,USA 3,ABF,ENG,IND 3,ABF,MATHS,IND 3,ABF,Science,IND df1 = pd.read_clipboard(sep=',') ID,Name,class,age 11,ABC,ENG,21 12,ABC,MATHS,23 1,ABC,Science,25 22,ABE,ENG,19 23,ABE,MATHS,22 24,ABE,Science,26 33,ABF,ENG,24 31,ABF,MATHS,28 32,ABF,Science,26 df2 = pd.read_clipboard(sep=',')
I would like to do the below
a) Check whether the ID and Name from df1 is present in df2.
b) If present in df2, put Yes in Status column or No in Status column. Don’t use ~ or not in operator because my df2 has million of rows. So, it will result in irrelevant results
I tried the below
ID_list = df1['ID'].unique().tolist() Name_list = df1['Name'].unique().tolist() filtered_df = df2[((df2['ID'].isin(ID_list)) & (df2['Name'].isin(Name_list)))] filtered_df = filtered_df.groupby(['ID','Name','Sub']).size().reset_index()
The above code gives matching ids and names between df1 and df2.
But I want to find the ids and names that are present in df1 but missing/absent in df2. I cannot use the ~ operator because it will return all the rows from df2 that don’t have a match in df1. In real world, my df2 has millions of rows. I only want to find the missing df1 ids and names and put a status column
I expect my output to be like as below
ID,Name,Sub,Country, Status 1,ABC,ENG,UK,No 1,ABC,MATHS,UK,No 1,ABC,Science,UK,Yes 2,ABE,ENG,USA,No 2,ABE,MATHS,USA,No 2,ABE,Science,USA,No 3,ABF,ENG,IND,No 3,ABF,MATHS,IND,No 3,ABF,Science,IND,No
Advertisement
Answer
Expected ouput is for match by 3 columns:
m = df1.merge(df2, 
              left_on=['ID','Name','Sub'],
              right_on=['ID','Name','class'], 
              indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
   ID Name      Sub Country Status
0   1  ABC      ENG      UK     No
1   1  ABC    MATHS      UK     No
2   1  ABC  Science      UK    Yes
3   2  ABE      ENG     USA     No
4   2  ABE    MATHS     USA     No
5   2  ABE  Science     USA     No
6   3  ABF      ENG     IND     No
7   3  ABF    MATHS     IND     No
8   3  ABF  Science     IND     No
With testing by isin solution is:
idx1 = pd.MultiIndex.from_frame(df1[['ID','Name','Sub']]) idx2 = pd.MultiIndex.from_frame(df2[['ID','Name','class']]) df1['Status'] = np.where(idx1.isin(idx2), 'Yes', 'No') print (df1) ID Name Sub Country Status 0 1 ABC ENG UK No 1 1 ABC MATHS UK No 2 1 ABC Science UK Yes 3 2 ABE ENG USA No 4 2 ABE MATHS USA No 5 2 ABE Science USA No 6 3 ABF ENG IND No 7 3 ABF MATHS IND No 8 3 ABF Science IND No
Because if match by 2 columns ouput is different:
m = df1.merge(df2, on=['ID','Name'], indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
   ID Name      Sub Country Status
0   1  ABC      ENG      UK    Yes
1   1  ABC    MATHS      UK    Yes
2   1  ABC  Science      UK    Yes
3   2  ABE      ENG     USA     No
4   2  ABE    MATHS     USA     No
5   2  ABE  Science     USA     No
6   3  ABF      ENG     IND     No
7   3  ABF    MATHS     IND     No
8   3  ABF  Science     IND     No
