Skip to content
Advertisement

Compare two dataframe column values and join with condition in python?

I need to join the below dataframe based on some condition.

df1:
    Id             Value
    [101,102,103]   10001
    [101,102,104]   10000
    [101,102,105]   10002
    [101,107,105]   10003

df2:

    Id                 Product_Name
    [101,102,103,104]  Shoe
    [101,102,109,104]  jeans
    [101,105,102,108]  make-up
    [101,105,106,118]  shirt

df_output

Id             Value    Product_Name
[101,102,103]   10001   Shoe           -- Every value present in df2.Id list
[101,102,104]   10000   Jeans          -- Every value present in df2.Id list
[101,102,105]   10002   Make-Up        -- Every value present in df2.Id list
[101,107,105]   10003   NaN            -- Not Every Value matches in df2.Id list.

I need to join two dataframe df1, df2 based on Id column but every element should be in df.Id list that’s when we consider it a match.

Python Code:

import pandas as pd

df_output = pd.merge(df1, df2, on='Id', how='left')

Advertisement

Answer

While this isn’t a highly efficient solution, you can use some sets to solve this problem.

matches = df1["Id"].apply(set) <= df2["Id"].apply(set)

out = df1.copy()
out.loc[matches, df2.columns.difference(["Id"])] = df2

print(out)
                Id  Value Product_Name
0  [101, 102, 103]  10001         Shoe
1  [101, 102, 104]  10000        jeans
2  [101, 102, 105]  10002      make-up
3  [101, 107, 105]  10003          NaN

In the above snippet:

  1. matches = df1["Id"].apply(set) <= df2["Id"].apply(set) returns a boolean Series that is True where the contents of each row in df1[‘Id’] is in the corresponding row in df2[‘Id’], and False otherwise
  2. Instead of performing an actual merge we can simply align the 2 DataFrames on the aforementioned boolean Series

If you want to test Ids against eachother in both dataframes, you can take the cartesian product of both DataFrames, filter it down to the inner join via the set criteria, and then append back any missing left join keys.

out = (
    pd.merge(df1, df2, how="cross")
    .loc[lambda df: df["Id_x"].map(set) <= df["Id_y"].map(set)]
    .pipe(
        lambda df: df.append(
             df1.loc[~df1["Id"].isin(df["Id_x"])].rename(columns={"Id": "Id_x"})
         )
    )
    .reset_index(drop=True)
)


print(out)
              Id_x  Value                  Id_y Product_Name
0  [101, 102, 103]  10001  [101, 102, 103, 104]         Shoe
1  [101, 102, 104]  10000  [101, 102, 103, 104]         Shoe
2  [101, 102, 104]  10000  [101, 102, 109, 104]        jeans
3  [101, 102, 105]  10002  [101, 105, 102, 108]      make-up
4  [101, 107, 105]  10003                   NaN          NaN

Advertisement