Condition:
- If ID has a Score ‘High’ or ‘Mid’ -> remove None
- If ID only has Score None -> just keep None
Input:
| ID | Score |
|---|---|
| AAA | High |
| AAA | Mid |
| AAA | None |
| BBB | None |
Desired output:
| ID | Score |
|---|---|
| AAA | High |
| AAA | Mid |
| BBB | None |
I’m having difficulty in writing the if condition in PySpark. Is there any other way to tackle this problem?
Advertisement
Answer
You can count Score over window partitioned by ID, then filter on Score is not null or the count is 0 :
from pyspark.sql import Window
from pyspark.sql import functions as F
df1 = df.withColumn(
"count_scores",
F.count("Score").over(Window.partitionBy("ID"))
).where("Score IS NOT NULL OR count_scores = 0")
.drop("count_scores")
df1.show()
#+---+-----+
#| ID|Score|
#+---+-----+
#|BBB| null|
#|AAA| High|
#|AAA| Mid|
#+---+-----+