How to apply condition in PySpark to keep null only if one else remove nulls

Question

Condition: If ID has a Score 'High' or 'Mid' -> remove None If ID only has Score None -> just keep None Input: ID Score AAA High AAA Mid AAA None BBB None Desired output: ID Score AAA High AAA Mid BBB None I'm having difficulty in writing the if condition in PySpark. Is there any other way to tackle

Accepted Answer

You can count Score over window partitioned by ID, then filter on Score is not null or the count is 0 :from pyspark.sql import Windowfrom pyspark.sql import functions as Fdf1 = df.withColumn(    "count_scores",    F.count("Score").over(Window.partitionBy("ID"))).where("Score IS NOT NULL OR count_scores = 0") .drop("count_scores")df1.show()#+---+-----+#| ID|Score|#+---+-----+#|BBB| null|#|AAA| High|#|AAA|  Mid|#+---+-----+

Advertisement

Answer