Skip to content
Advertisement

How to write a universal function to join two PySpark dataframes?

How to write a universal function to join two PySpark dataframes?

I want to write a function that performs inner join on two dataframes and also eliminates the repeated common column after joining. As far as I’m aware there is no way to do that, as we always need to define common columns manually while joining. Or is there a way?

Advertisement

Answer

If you need to include all the common columns in the join condition then you can extract them into a list and pass to join(). After the join, just call drop on these same columns to eliminate them from the result.

common_cols = list(set(df.columns).intersection(set(df2.columns)))

df3 = df.join(df2, common_cols, how='inner').drop(*common_cols)
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement