I’m new to Pyspark and I have a problem to solve.
I have a dataframe with 4 columns, being customers
, person
, is_online_store
and count
:
customer | PersonId | is_online_store | count |
---|---|---|---|
afabd2d2 | 4 | true | 1 |
afabd2d2 | 8 | true | 2 |
afabd2d2 | 3 | true | 1 |
afabd2d2 | 2 | false | 1 |
afabd2d2 | 4 | false | 1 |
I need to create according to the following rules:
- If
PersonId
count(column) has 1 or + Trues and None False thenOnline
- If
PersonId
count(column) has 1 or + False and None True thenoffline
- If the
PersonId
count(column) has at least 1 False AND 1 True thenHybrid
How it should look:
customer | PersonId | New_Column |
---|---|---|
afabd2d2 | 4 | Hybrid |
afabd2d2 | 8 | Online |
afabd2d2 | 3 | Online |
afabd2d2 | 2 | Offline |
Advertisement
Answer
Group by customer
+ personId
and use when
expression to check if all values in is_online_store
column are true
/ false
or a mix of the 2, using for example bool_and
function:
JavaScript
x
18
18
1
from pyspark.sql import functions as F
2
3
df1 = df.groupBy("customer", "PersonId").agg(
4
F.when(F.expr("bool_and(is_online_store)"), "Online")
5
.when(F.expr("bool_and(!is_online_store)"), "Offline")
6
.otherwise("Hybrid").alias("New_Column")
7
)
8
9
df1.show()
10
#+--------+--------+----------+
11
#|customer|PersonId|New_Column|
12
#+--------+--------+----------+
13
#|afabd2d2| 2| Offline|
14
#|afabd2d2| 8| Online|
15
#|afabd2d2| 4| Hybrid|
16
#|afabd2d2| 3| Online|
17
#+--------+--------+----------+
18