Skip to content
Advertisement

pyspark create all possible combinations of column values of a dataframe

I want to get all the possible combinations of size 2 of a column in pyspark dataframe. My pyspark dataframe looks like

JavaScript

One way would be to collect the values and get them into python iterable (list, pandas df) and use itertools.combinations to generate all combinations.

JavaScript

However, I want to avoid collecting the dataframe column to the driver since the rows can be extremely large. Is there a better way to achieve this using spark APIs?

Advertisement

Answer

You can use the crossJoin method, and then cull the lines with id1 > id2.

JavaScript
Advertisement