Skip to content
Advertisement

Test of one dataframe in another

I have a pyspark dataframe df:

A    B     C
E00  FT   AS
E01  FG   AD
E02  FF   AB
E03  FH   AW
E04  FF   AQ
E05  FV   AR   
E06  FD   AE

and another smaller pyspark dataframe but with 3 rows with the same values, df2:

A    B    C
Null QW   Null
QA   Null Null
E03  FH   AW
E04  FF   AQ
E05  FV   AR
Null Null Null

Is there a way in pyspark to create a third boolean dataframe from the rows in df2 are in df? Such as:

  A     B      C
False  False  False
False  False  False
False  False  False
True   True   True
True   True   True
True   True   True
False  False  False

Many thanks in advance.

Advertisement

Answer

You can do a left join and assign False if all columns joined from df2 are null:

import pyspark.sql.functions as F

result = df1.alias('df1').join(
    df2.alias('df2'),
    F.least(*[F.expr(f'df1.{c} = df2.{c}') for c in df1.columns]), 
    'left'
).select(
    [
        (~F.greatest(*[F.col(f'df2.{c2}').isNull() for c2 in df2.columns])).alias(c)
        for c in df1.columns
    ]
)

result.show()
+-----+-----+-----+
|    A|    B|    C|
+-----+-----+-----+
|false|false|false|
|false|false|false|
|false|false|false|
| true| true| true|
| true| true| true|
| true| true| true|
|false|false|false|
+-----+-----+-----+
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement