pandas join tables on two columns without ordering of values

I would like to achieve what it’s described here: stackoverflow question, but only using standard pandas.

I have two dataframes: Fist

  first_employee target_employee  relationship
0            Andy          Claude             0
1            Andy           Frida            20
2            Andy         Georgia           -10
3            Andy            Joan            30
4            Andy             Lee           -10
5            Andy           Pablo           -10
6            Andy         Vincent            20
7          Claude           Frida             0
8          Claude         Georgia            90
9          Claude            Joan             0
10         Claude             Lee             0
11         Claude           Pablo            10
12         Claude         Vincent             0
13          Frida         Georgia             0
14          Frida            Joan             0
15          Frida             Lee             0
16          Frida           Pablo            50
17          Frida         Vincent            60
18        Georgia            Joan             0
19        Georgia             Lee            10
20        Georgia           Pablo             0
21        Georgia         Vincent             0
22           Joan             Lee            70
23           Joan           Pablo             0
24           Joan         Vincent            10
25            Lee           Pablo             0
26            Lee         Vincent             0
27          Pablo         Vincent           -20

JavaScript
​x
 
  first_employee target_employee  relationship
          Andy          Claude             0
          Andy           Frida            20
          Andy         Georgia           -10
          Andy            Joan            30
          Andy             Lee           -10
          Andy           Pablo           -10
          Andy         Vincent            20
        Claude           Frida             0
        Claude         Georgia            90
        Claude            Joan             0
       Claude             Lee             0
       Claude           Pablo            10
       Claude         Vincent             0
        Frida         Georgia             0
        Frida            Joan             0
        Frida             Lee             0
        Frida           Pablo            50
        Frida         Vincent            60
      Georgia            Joan             0
      Georgia             Lee            10
      Georgia           Pablo             0
      Georgia         Vincent             0
         Joan             Lee            70
         Joan           Pablo             0
         Joan         Vincent            10
          Lee           Pablo             0
          Lee         Vincent             0
        Pablo         Vincent           -20
​

Second:

   first_employee target_employee  book_count
0         Vincent           Frida           2
1         Vincent           Pablo           1
2            Andy          Claude           1
3            Andy            Joan           1
4            Andy           Pablo           1
5            Andy             Lee           1
6            Andy           Frida           1
7            Andy         Georgia           1
8          Claude         Georgia           3
9            Joan             Lee           3
10          Pablo           Frida           2

JavaScript
 
   first_employee target_employee  book_count
       Vincent           Frida           2
       Vincent           Pablo           1
          Andy          Claude           1
          Andy            Joan           1
          Andy           Pablo           1
          Andy             Lee           1
          Andy           Frida           1
          Andy         Georgia           1
        Claude         Georgia           3
          Joan             Lee           3
        Pablo           Frida           2
​

I want to join the two dataframes such that my final dataframe is identical to the first one, but it has also the book_count column with the corresponding values (and NaN if not available).

I already wrote something like: joined_df = first_df.merge(second_df, on = ['first_employee', 'target_employee'], how = 'outer') and I get:

   first_employee target_employee  relationship  book_count
0            Andy          Claude           0.0         1.0
1            Andy           Frida          20.0         1.0
2            Andy         Georgia         -10.0         1.0
3            Andy            Joan          30.0         1.0
4            Andy             Lee         -10.0         1.0
5            Andy           Pablo         -10.0         1.0
6            Andy         Vincent          20.0         NaN
7          Claude           Frida           0.0         NaN
8          Claude         Georgia          90.0         3.0
9          Claude            Joan           0.0         NaN
10         Claude             Lee           0.0         NaN
11         Claude           Pablo          10.0         NaN
12         Claude         Vincent           0.0         NaN
13          Frida         Georgia           0.0         NaN
14          Frida            Joan           0.0         NaN
15          Frida             Lee           0.0         NaN
16          Frida           Pablo          50.0         NaN
17          Frida         Vincent          60.0         NaN
18        Georgia            Joan           0.0         NaN
19        Georgia             Lee          10.0         NaN
20        Georgia           Pablo           0.0         NaN
21        Georgia         Vincent           0.0         NaN
22           Joan             Lee          70.0         3.0
23           Joan           Pablo           0.0         NaN
24           Joan         Vincent          10.0         NaN
25            Lee           Pablo           0.0         NaN
26            Lee         Vincent           0.0         NaN
27          Pablo         Vincent         -20.0         NaN
28        Vincent           Frida           NaN         2.0
29        Vincent           Pablo           NaN         1.0
30          Pablo           Frida           NaN         2.0

JavaScript
 
   first_employee target_employee  relationship  book_count
          Andy          Claude           0.0         1.0
          Andy           Frida          20.0         1.0
          Andy         Georgia         -10.0         1.0
          Andy            Joan          30.0         1.0
          Andy             Lee         -10.0         1.0
          Andy           Pablo         -10.0         1.0
          Andy         Vincent          20.0         NaN
        Claude           Frida           0.0         NaN
        Claude         Georgia          90.0         3.0
        Claude            Joan           0.0         NaN
       Claude             Lee           0.0         NaN
       Claude           Pablo          10.0         NaN
       Claude         Vincent           0.0         NaN
        Frida         Georgia           0.0         NaN
        Frida            Joan           0.0         NaN
        Frida             Lee           0.0         NaN
        Frida           Pablo          50.0         NaN
        Frida         Vincent          60.0         NaN
      Georgia            Joan           0.0         NaN
      Georgia             Lee          10.0         NaN
      Georgia           Pablo           0.0         NaN
      Georgia         Vincent           0.0         NaN
         Joan             Lee          70.0         3.0
         Joan           Pablo           0.0         NaN
         Joan         Vincent          10.0         NaN
          Lee           Pablo           0.0         NaN
          Lee         Vincent           0.0         NaN
        Pablo         Vincent         -20.0         NaN
      Vincent           Frida           NaN         2.0
      Vincent           Pablo           NaN         1.0
        Pablo           Frida           NaN         2.0
​

And it is somewhat close to what I would like to achieve. However, the ordering of the values in the first_employee and target_employee it’s not relevant, so if in the first dataframe I have (Frida,Vincent) and in the second (Vincent, Frida), these twos should be merged together (what matters are the values, not the column-wise order).

In my resulting dataframe i get three extra rows:

28        Vincent           Frida           NaN         2.0
29        Vincent           Pablo           NaN         1.0
30          Pablo           Frida           NaN         2.0

JavaScript
 
28        Vincent           Frida           NaN         2.0
29        Vincent           Pablo           NaN         1.0
30          Pablo           Frida           NaN         2.0
​

which are the result of my merging that considers “ordered” values columns-wise to make the join: these 3 extra rows should be merged on the already available couples (Frida, Vincent) (Pablo, Vincent) and (Frida, Pablo).

Is there a way to do so using only standard pandas functions? (the question I cited at the beginning uses sqldf)

Answer

I believe this is what you are looking for. Using np.sort will change the order of the first two columns for each row so it is alphabetical, allowing the merge to work correctly.

cols = ['first_employee','target_employee']
df[cols] = np.sort(df[cols].to_numpy(),axis=1)
df2[cols] = np.sort(df2[cols].to_numpy(),axis=1)
ndf = pd.merge(df,df2,on = cols,how='left')

JavaScript
 
cols = ['first_employee','target_employee']
df[cols] = np.sort(df[cols].to_numpy(),axis=1)
df2[cols] = np.sort(df2[cols].to_numpy(),axis=1)
ndf = pd.merge(df,df2,on = cols,how='left')
​

Advertisement

Answer