i have 2 dataframes have same columns with different len.
in [1] : df_g = pd.DataFrame([['EDC_TRAING_NO', 'EDU_T_N', 'NUMVER', '20'], ['EDC_TRAING_NAME', 'EDU_T_NM', 'VARCHAR', '40'], ['EDC_TRAING_ST', 'EDU_T_SD', 'DATETIME', '15'], ['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'], ['EDC_PLACE_NM', 'EDU_P_NM', 'VARCHAR2', '40'], ['ONLINE_REQST_POSBL_AT', 'ONLINE_R_P_A', 'VARCHAR2', '1']], columns=['NAME', 'ID', 'TYPE', 'LEN']) in [2] : df_n = pd.DataFrame([['EDC_TRAING_NO', 'EDU_TR_N', 'NUMVER', '20'], ['EDC_TRAING_NAME', 'EDU_TR_NM', 'VARCHAR', '20'], ['EDC_TRAING_ST', 'EDU_TR_SD', 'DATETIME', '15'], ['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'], ['EDC_PLACE_NM', 'EDU_PL_NM', 'VARCHAR2', '40'], ['ONLINE_REQST_POSBL_AT', 'ONLINE_REQ_P_A', 'VARCHAR2', '1']], columns=['NAME', 'ID', 'TYPE', 'LEN'])
the reuslt i want to get:
result = pd.DataFrame([['EDC_TRAING_NO', 'EDU_TR_N', 'NUMVER', '20'], ['EDC_TRAING_ST', 'EDU_TR_SD', 'DATETIME', '15'], ['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'], ['EDC_PLACE_NM', 'EDU_PL_NM', 'VARCHAR2', '40'], ['ONLINE_REQST_POSBL_AT', 'ONLINE_REQ_P_A', 'VARCHAR2', '1']], columns=['NAME', 'ID', 'TYPE', 'LEN'])
and each df have length like this.
len(df_g) : 1000 len(df_n) : 5000
each dataframe has column named ‘name, id, type, len‘
i need to check those columns(name,type,len) in each df to compare ‘id‘ column whether it has same value or not.
so i tried like this.
for i in g.index: for j in n.index: g = g.iloc[i].values # make it to ndarray g_Str = g[0] + g[2] + g[3] # make it to str for pivot n = n.iloc[j].values n_Str = n[0] + str(n[2]) + str(n[3]) # comparing and check two df if g_Str == n_Str and g[1] != n[1]: print(i, j) print(g[0])
I have above code for 2 different length DataFrame. first i tried with ‘iterrows()’ for comparing those two df, but it took too much time.(very slow)
i looked up for other ways to make it work better performance.
possible ways i found
option1 transform df to dict with to_dict() / compare those in nested for-loop
option2 transform df.series to ndarray / compare those in nested for-loop
is there any other better option? or any option to not using nested for-loop?
thx.
Advertisement
Answer
you can try merge, and if you are looking for records where ids do mismatch then the following is one way of achieving it:
r1=df_g.merge(df_n,on=['NAME', 'TYPE', 'LEN'],how='inner').query('ID_x != ID_y').rename(columns={'ID_x': 'ID'}).drop('ID_y', 1)
I have used how=”inner” join, but based on need can use any of the following joins: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’