Condition:
- If ID has a Score ‘High’ or ‘Mid’ -> remove None
- If ID only has Score None -> just keep None
Input:
ID | Score |
---|---|
AAA | High |
AAA | Mid |
AAA | None |
BBB | None |
Desired output:
ID | Score |
---|---|
AAA | High |
AAA | Mid |
BBB | None |
I’m having difficulty in writing the if condition in PySpark. Is there any other way to tackle this problem?
Advertisement
Answer
You can count Score
over window partitioned by ID
, then filter on Score
is not null or the count is 0 :
JavaScript
x
19
19
1
from pyspark.sql import Window
2
from pyspark.sql import functions as F
3
4
df1 = df.withColumn(
5
"count_scores",
6
F.count("Score").over(Window.partitionBy("ID"))
7
).where("Score IS NOT NULL OR count_scores = 0")
8
.drop("count_scores")
9
10
df1.show()
11
12
#+---+-----+
13
#| ID|Score|
14
#+---+-----+
15
#|BBB| null|
16
#|AAA| High|
17
#|AAA| Mid|
18
#+---+-----+
19