I’m new to Pyspark and I have a problem to solve.
I have a dataframe with 4 columns, being customers
, person
, is_online_store
and count
:
customer | PersonId | is_online_store | count |
---|---|---|---|
afabd2d2 | 4 | true | 1 |
afabd2d2 | 8 | true | 2 |
afabd2d2 | 3 | true | 1 |
afabd2d2 | 2 | false | 1 |
afabd2d2 | 4 | false | 1 |
I need to create according to the following rules:
- If
PersonId
count(column) has 1 or + Trues and None False thenOnline
- If
PersonId
count(column) has 1 or + False and None True thenoffline
- If the
PersonId
count(column) has at least 1 False AND 1 True thenHybrid
How it should look:
customer | PersonId | New_Column |
---|---|---|
afabd2d2 | 4 | Hybrid |
afabd2d2 | 8 | Online |
afabd2d2 | 3 | Online |
afabd2d2 | 2 | Offline |
Advertisement
Answer
Group by customer
+ personId
and use when
expression to check if all values in is_online_store
column are true
/ false
or a mix of the 2, using for example bool_and
function:
from pyspark.sql import functions as F df1 = df.groupBy("customer", "PersonId").agg( F.when(F.expr("bool_and(is_online_store)"), "Online") .when(F.expr("bool_and(!is_online_store)"), "Offline") .otherwise("Hybrid").alias("New_Column") ) df1.show() #+--------+--------+----------+ #|customer|PersonId|New_Column| #+--------+--------+----------+ #|afabd2d2| 2| Offline| #|afabd2d2| 8| Online| #|afabd2d2| 4| Hybrid| #|afabd2d2| 3| Online| #+--------+--------+----------+