I have a pyspark dataframe df:
JavaScript
x
9
1
A B C
2
E00 FT AS
3
E01 FG AD
4
E02 FF AB
5
E03 FH AW
6
E04 FF AQ
7
E05 FV AR
8
E06 FD AE
9
and another smaller pyspark dataframe but with 3 rows with the same values, df2:
JavaScript
1
8
1
A B C
2
Null QW Null
3
QA Null Null
4
E03 FH AW
5
E04 FF AQ
6
E05 FV AR
7
Null Null Null
8
Is there a way in pyspark to create a third boolean dataframe from the rows in df2 are in df? Such as:
JavaScript
1
9
1
A B C
2
False False False
3
False False False
4
False False False
5
True True True
6
True True True
7
True True True
8
False False False
9
Many thanks in advance.
Advertisement
Answer
You can do a left join and assign False
if all columns joined from df2
are null:
JavaScript
1
26
26
1
import pyspark.sql.functions as F
2
3
result = df1.alias('df1').join(
4
df2.alias('df2'),
5
F.least(*[F.expr(f'df1.{c} = df2.{c}') for c in df1.columns]),
6
'left'
7
).select(
8
[
9
(~F.greatest(*[F.col(f'df2.{c2}').isNull() for c2 in df2.columns])).alias(c)
10
for c in df1.columns
11
]
12
)
13
14
result.show()
15
+-----+-----+-----+
16
| A| B| C|
17
+-----+-----+-----+
18
|false|false|false|
19
|false|false|false|
20
|false|false|false|
21
| true| true| true|
22
| true| true| true|
23
| true| true| true|
24
|false|false|false|
25
+-----+-----+-----+
26