I have a pyspark dataframe df:
A B C E00 FT AS E01 FG AD E02 FF AB E03 FH AW E04 FF AQ E05 FV AR E06 FD AE
and another smaller pyspark dataframe but with 3 rows with the same values, df2:
A B C Null QW Null QA Null Null E03 FH AW E04 FF AQ E05 FV AR Null Null Null
Is there a way in pyspark to create a third boolean dataframe from the rows in df2 are in df? Such as:
A B C False False False False False False False False False True True True True True True True True True False False False
Many thanks in advance.
Advertisement
Answer
You can do a left join and assign False
if all columns joined from df2
are null:
import pyspark.sql.functions as F result = df1.alias('df1').join( df2.alias('df2'), F.least(*[F.expr(f'df1.{c} = df2.{c}') for c in df1.columns]), 'left' ).select( [ (~F.greatest(*[F.col(f'df2.{c2}').isNull() for c2 in df2.columns])).alias(c) for c in df1.columns ] ) result.show() +-----+-----+-----+ | A| B| C| +-----+-----+-----+ |false|false|false| |false|false|false| |false|false|false| | true| true| true| | true| true| true| | true| true| true| |false|false|false| +-----+-----+-----+