Skip to content
Advertisement

How to apply condition in PySpark to keep null only if one else remove nulls

Condition:

  • If ID has a Score ‘High’ or ‘Mid’ -> remove None
  • If ID only has Score None -> just keep None

Input:

ID Score
AAA High
AAA Mid
AAA None
BBB None

Desired output:

ID Score
AAA High
AAA Mid
BBB None

I’m having difficulty in writing the if condition in PySpark. Is there any other way to tackle this problem?

Advertisement

Answer

You can count Score over window partitioned by ID, then filter on Score is not null or the count is 0 :

from pyspark.sql import Window
from pyspark.sql import functions as F

df1 = df.withColumn(
    "count_scores",
    F.count("Score").over(Window.partitionBy("ID"))
).where("Score IS NOT NULL OR count_scores = 0")
 .drop("count_scores")

df1.show()

#+---+-----+
#| ID|Score|
#+---+-----+
#|BBB| null|
#|AAA| High|
#|AAA|  Mid|
#+---+-----+
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement