I would like to achieve what it’s described here: stackoverflow question, but only using standard pandas.
I have two dataframes: Fist
first_employee target_employee relationship 0 Andy Claude 0 1 Andy Frida 20 2 Andy Georgia -10 3 Andy Joan 30 4 Andy Lee -10 5 Andy Pablo -10 6 Andy Vincent 20 7 Claude Frida 0 8 Claude Georgia 90 9 Claude Joan 0 10 Claude Lee 0 11 Claude Pablo 10 12 Claude Vincent 0 13 Frida Georgia 0 14 Frida Joan 0 15 Frida Lee 0 16 Frida Pablo 50 17 Frida Vincent 60 18 Georgia Joan 0 19 Georgia Lee 10 20 Georgia Pablo 0 21 Georgia Vincent 0 22 Joan Lee 70 23 Joan Pablo 0 24 Joan Vincent 10 25 Lee Pablo 0 26 Lee Vincent 0 27 Pablo Vincent -20
Second:
first_employee target_employee book_count 0 Vincent Frida 2 1 Vincent Pablo 1 2 Andy Claude 1 3 Andy Joan 1 4 Andy Pablo 1 5 Andy Lee 1 6 Andy Frida 1 7 Andy Georgia 1 8 Claude Georgia 3 9 Joan Lee 3 10 Pablo Frida 2
I want to join the two dataframes such that my final dataframe is identical to the first one, but it has also the book_count
column with the corresponding values (and NaN if not available).
I already wrote something like: joined_df = first_df.merge(second_df, on = ['first_employee', 'target_employee'], how = 'outer')
and I get:
first_employee target_employee relationship book_count 0 Andy Claude 0.0 1.0 1 Andy Frida 20.0 1.0 2 Andy Georgia -10.0 1.0 3 Andy Joan 30.0 1.0 4 Andy Lee -10.0 1.0 5 Andy Pablo -10.0 1.0 6 Andy Vincent 20.0 NaN 7 Claude Frida 0.0 NaN 8 Claude Georgia 90.0 3.0 9 Claude Joan 0.0 NaN 10 Claude Lee 0.0 NaN 11 Claude Pablo 10.0 NaN 12 Claude Vincent 0.0 NaN 13 Frida Georgia 0.0 NaN 14 Frida Joan 0.0 NaN 15 Frida Lee 0.0 NaN 16 Frida Pablo 50.0 NaN 17 Frida Vincent 60.0 NaN 18 Georgia Joan 0.0 NaN 19 Georgia Lee 10.0 NaN 20 Georgia Pablo 0.0 NaN 21 Georgia Vincent 0.0 NaN 22 Joan Lee 70.0 3.0 23 Joan Pablo 0.0 NaN 24 Joan Vincent 10.0 NaN 25 Lee Pablo 0.0 NaN 26 Lee Vincent 0.0 NaN 27 Pablo Vincent -20.0 NaN 28 Vincent Frida NaN 2.0 29 Vincent Pablo NaN 1.0 30 Pablo Frida NaN 2.0
And it is somewhat close to what I would like to achieve. However, the ordering of the values in the first_employee
and target_employee
it’s not relevant, so if in the first dataframe I have (Frida,Vincent)
and in the second (Vincent, Frida)
, these twos should be merged together (what matters are the values, not the column-wise order).
In my resulting dataframe i get three extra rows:
28 Vincent Frida NaN 2.0 29 Vincent Pablo NaN 1.0 30 Pablo Frida NaN 2.0
which are the result of my merging that considers “ordered” values columns-wise to make the join: these 3 extra rows should be merged on the already available couples (Frida, Vincent)
(Pablo, Vincent)
and (Frida, Pablo)
.
Is there a way to do so using only standard pandas
functions? (the question I cited at the beginning uses sqldf
)
Advertisement
Answer
I believe this is what you are looking for. Using np.sort
will change the order of the first two columns for each row so it is alphabetical, allowing the merge to work correctly.
cols = ['first_employee','target_employee'] df[cols] = np.sort(df[cols].to_numpy(),axis=1) df2[cols] = np.sort(df2[cols].to_numpy(),axis=1) ndf = pd.merge(df,df2,on = cols,how='left')