I’m new to Pyspark and I have a problem to solve.
I have a dataframe with 4 columns, being customers, person, is_online_store and count:
| customer | PersonId | is_online_store | count |
|---|---|---|---|
| afabd2d2 | 4 | true | 1 |
| afabd2d2 | 8 | true | 2 |
| afabd2d2 | 3 | true | 1 |
| afabd2d2 | 2 | false | 1 |
| afabd2d2 | 4 | false | 1 |
I need to create according to the following rules:
- If
PersonIdcount(column) has 1 or + Trues and None False thenOnline - If
PersonIdcount(column) has 1 or + False and None True thenoffline - If the
PersonIdcount(column) has at least 1 False AND 1 True thenHybrid
How it should look:
| customer | PersonId | New_Column |
|---|---|---|
| afabd2d2 | 4 | Hybrid |
| afabd2d2 | 8 | Online |
| afabd2d2 | 3 | Online |
| afabd2d2 | 2 | Offline |
Advertisement
Answer
Group by customer + personId and use when expression to check if all values in is_online_store column are true / false or a mix of the 2, using for example bool_and function:
from pyspark.sql import functions as F
df1 = df.groupBy("customer", "PersonId").agg(
F.when(F.expr("bool_and(is_online_store)"), "Online")
.when(F.expr("bool_and(!is_online_store)"), "Offline")
.otherwise("Hybrid").alias("New_Column")
)
df1.show()
#+--------+--------+----------+
#|customer|PersonId|New_Column|
#+--------+--------+----------+
#|afabd2d2| 2| Offline|
#|afabd2d2| 8| Online|
#|afabd2d2| 4| Hybrid|
#|afabd2d2| 3| Online|
#+--------+--------+----------+