Condition:
- If ID has a Score ‘High’ or ‘Mid’ -> remove None
- If ID only has Score None -> just keep None
Input:
ID | Score |
---|---|
AAA | High |
AAA | Mid |
AAA | None |
BBB | None |
Desired output:
ID | Score |
---|---|
AAA | High |
AAA | Mid |
BBB | None |
I’m having difficulty in writing the if condition in PySpark. Is there any other way to tackle this problem?
Advertisement
Answer
You can count Score
over window partitioned by ID
, then filter on Score
is not null or the count is 0 :
from pyspark.sql import Window from pyspark.sql import functions as F df1 = df.withColumn( "count_scores", F.count("Score").over(Window.partitionBy("ID")) ).where("Score IS NOT NULL OR count_scores = 0") .drop("count_scores") df1.show() #+---+-----+ #| ID|Score| #+---+-----+ #|BBB| null| #|AAA| High| #|AAA| Mid| #+---+-----+