How to write a universal function to join two PySpark dataframes?

Question

How to write a universal function to join two PySpark dataframes? I want to write a function that performs inner join on two dataframes and also eliminates the repeated common column after joining. As far as I'm aware there is no way to do that, as we always need to define common columns manually while joining. Or is there a

Accepted Answer

If you need to include all the common columns in the join condition then you can extract them into a list and pass to join(). After the join, just call drop on these same columns to eliminate them from the result.common_cols = list(set(df.columns).intersection(set(df2.columns)))df3 = df.join(df2, common_cols, how='inner').drop(*common_cols)

Advertisement

Answer