I want to get all the possible combinations of size 2 of a column in pyspark dataframe. My pyspark dataframe looks like
JavaScript
x
15
15
1
| id |
2
| 1 |
3
| 2 |
4
| 3 |
5
| 4 |
6
7
For above input, I want to get output as
8
9
| id1 | id2 |
10
| 1 | 2 |
11
| 1 | 3 |
12
| 1 | 4 |
13
| 2 | 3 |
14
and so on..
15
One way would be to collect the values and get them into python iterable (list, pandas df) and use itertools.combinations
to generate all combinations.
JavaScript
1
3
1
values = df.select(F.collect_list('id')).first()[0]
2
combns = list(itertools.combinations(values, 2))
3
However, I want to avoid collecting the dataframe column to the driver since the rows can be extremely large. Is there a better way to achieve this using spark APIs?
Advertisement
Answer
You can use the crossJoin
method, and then cull the lines with id1 > id2
.
JavaScript
1
2
1
df = df.toDF('id1').crossJoin(df.toDF('id2')).filter('id1 < id2')
2