Comparing two Dataframes with diff length to find difference in specific column

i have 2 dataframes have same columns with different len.

in [1] : df_g = pd.DataFrame([['EDC_TRAING_NO', 'EDU_T_N', 'NUMVER', '20'],
               ['EDC_TRAING_NAME', 'EDU_T_NM', 'VARCHAR', '40'],
               ['EDC_TRAING_ST', 'EDU_T_SD', 'DATETIME', '15'],
               ['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
               ['EDC_PLACE_NM', 'EDU_P_NM', 'VARCHAR2', '40'],
               ['ONLINE_REQST_POSBL_AT', 'ONLINE_R_P_A', 'VARCHAR2', '1']],
              columns=['NAME', 'ID', 'TYPE', 'LEN'])

in [2] : df_n = pd.DataFrame([['EDC_TRAING_NO', 'EDU_TR_N', 'NUMVER', '20'],
               ['EDC_TRAING_NAME', 'EDU_TR_NM', 'VARCHAR', '20'],
               ['EDC_TRAING_ST', 'EDU_TR_SD', 'DATETIME', '15'],
               ['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
               ['EDC_PLACE_NM', 'EDU_PL_NM', 'VARCHAR2', '40'],
               ['ONLINE_REQST_POSBL_AT', 'ONLINE_REQ_P_A', 'VARCHAR2', '1']],
              columns=['NAME', 'ID', 'TYPE', 'LEN'])

JavaScript
​x
 
in [1] : df_g = pd.DataFrame([['EDC_TRAING_NO', 'EDU_T_N', 'NUMVER', '20'],
               ['EDC_TRAING_NAME', 'EDU_T_NM', 'VARCHAR', '40'],
               ['EDC_TRAING_ST', 'EDU_T_SD', 'DATETIME', '15'],
               ['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
               ['EDC_PLACE_NM', 'EDU_P_NM', 'VARCHAR2', '40'],
               ['ONLINE_REQST_POSBL_AT', 'ONLINE_R_P_A', 'VARCHAR2', '1']],
              columns=['NAME', 'ID', 'TYPE', 'LEN'])
​
in [2] : df_n = pd.DataFrame([['EDC_TRAING_NO', 'EDU_TR_N', 'NUMVER', '20'],
               ['EDC_TRAING_NAME', 'EDU_TR_NM', 'VARCHAR', '20'],
               ['EDC_TRAING_ST', 'EDU_TR_SD', 'DATETIME', '15'],
               ['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
               ['EDC_PLACE_NM', 'EDU_PL_NM', 'VARCHAR2', '40'],
               ['ONLINE_REQST_POSBL_AT', 'ONLINE_REQ_P_A', 'VARCHAR2', '1']],
              columns=['NAME', 'ID', 'TYPE', 'LEN'])
​

the reuslt i want to get:

result = pd.DataFrame([['EDC_TRAING_NO', 'EDU_TR_N', 'NUMVER', '20'],
                   ['EDC_TRAING_ST', 'EDU_TR_SD', 'DATETIME', '15'],
                   ['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
                   ['EDC_PLACE_NM', 'EDU_PL_NM', 'VARCHAR2', '40'],
                   ['ONLINE_REQST_POSBL_AT', 'ONLINE_REQ_P_A', 'VARCHAR2', '1']],
                  columns=['NAME', 'ID', 'TYPE', 'LEN'])

JavaScript
 
result = pd.DataFrame([['EDC_TRAING_NO', 'EDU_TR_N', 'NUMVER', '20'],
                   ['EDC_TRAING_ST', 'EDU_TR_SD', 'DATETIME', '15'],
                   ['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
                   ['EDC_PLACE_NM', 'EDU_PL_NM', 'VARCHAR2', '40'],
                   ['ONLINE_REQST_POSBL_AT', 'ONLINE_REQ_P_A', 'VARCHAR2', '1']],
                  columns=['NAME', 'ID', 'TYPE', 'LEN'])
​

and each df have length like this.

len(df_g) : 1000
len(df_n) : 5000

JavaScript
 
len(df_g) : 1000
len(df_n) : 5000
​

each dataframe has column named ‘name, id, type, len‘

i need to check those columns(name,type,len) in each df to compare ‘id‘ column whether it has same value or not.

so i tried like this.

for i in g.index:
    for j in n.index:
        g = g.iloc[i].values
        # make it to ndarray
        g_Str = g[0] + g[2] + g[3]
        # make it to str for pivot
        n = n.iloc[j].values
        n_Str = n[0] + str(n[2]) + str(n[3])
        # comparing and check two df
        if g_Str == n_Str and g[1] != n[1]:
            print(i, j)
            print(g[0])

JavaScript
 
for i in g.index:
    for j in n.index:
        g = g.iloc[i].values
        # make it to ndarray
        g_Str = g[0] + g[2] + g[3]
        # make it to str for pivot
        n = n.iloc[j].values
        n_Str = n[0] + str(n[2]) + str(n[3])
        # comparing and check two df
        if g_Str == n_Str and g[1] != n[1]:
            print(i, j)
            print(g[0])
​

I have above code for 2 different length DataFrame. first i tried with ‘iterrows()’ for comparing those two df, but it took too much time.(very slow)

i looked up for other ways to make it work better performance.

possible ways i found

option1 transform df to dict with to_dict() / compare those in nested for-loop
option2 transform df.series to ndarray / compare those in nested for-loop

is there any other better option? or any option to not using nested for-loop?

thx.

Answer

you can try merge, and if you are looking for records where ids do mismatch then the following is one way of achieving it:

r1=df_g.merge(df_n,on=['NAME', 'TYPE', 'LEN'],how='inner').query('ID_x != ID_y').rename(columns={'ID_x': 'ID'}).drop('ID_y', 1)

JavaScript
 
r1=df_g.merge(df_n,on=['NAME', 'TYPE', 'LEN'],how='inner').query('ID_x != ID_y').rename(columns={'ID_x': 'ID'}).drop('ID_y', 1)
​

I have used how=”inner” join, but based on need can use any of the following joins: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’

Advertisement

Answer