How to write a universal function to join two PySpark dataframes?
I want to write a function that performs inner join on two dataframes and also eliminates the repeated common column after joining. As far as I’m aware there is no way to do that, as we always need to define common columns manually while joining. Or is there a way?
Advertisement
Answer
If you need to include all the common columns in the join condition then you can extract them into a list and pass to join()
. After the join, just call drop
on these same columns to eliminate them from the result.
common_cols = list(set(df.columns).intersection(set(df2.columns))) df3 = df.join(df2, common_cols, how='inner').drop(*common_cols)