I have a pyspark dataframe df:
A B C E00 FT AS E01 FG AD E02 FF AB E03 FH AW E04 FF AQ E05 FV AR E06 FD AE
and another smaller pyspark dataframe but with 3 rows with the same values, df2:
A B C Null QW Null QA Null Null E03 FH AW E04 FF AQ E05 FV AR Null Null Null
Is there a way in pyspark to create a third boolean dataframe from the rows in df2 are in df? Such as:
A B C False False False False False False False False False True True True True True True True True True False False False
Many thanks in advance.
Advertisement
Answer
You can do a left join and assign False if all columns joined from df2 are null:
import pyspark.sql.functions as F
result = df1.alias('df1').join(
df2.alias('df2'),
F.least(*[F.expr(f'df1.{c} = df2.{c}') for c in df1.columns]),
'left'
).select(
[
(~F.greatest(*[F.col(f'df2.{c2}').isNull() for c2 in df2.columns])).alias(c)
for c in df1.columns
]
)
result.show()
+-----+-----+-----+
| A| B| C|
+-----+-----+-----+
|false|false|false|
|false|false|false|
|false|false|false|
| true| true| true|
| true| true| true|
| true| true| true|
|false|false|false|
+-----+-----+-----+