Skip to content
Advertisement

Logical with count in Pyspark

I’m new to Pyspark and I have a problem to solve.

I have a dataframe with 4 columns, being customers, person, is_online_store and count:

customer PersonId is_online_store count
afabd2d2 4 true 1
afabd2d2 8 true 2
afabd2d2 3 true 1
afabd2d2 2 false 1
afabd2d2 4 false 1

I need to create according to the following rules:

  • If PersonId count(column) has 1 or + Trues and None False then Online
  • If PersonId count(column) has 1 or + False and None True then offline
  • If the PersonId count(column) has at least 1 False AND 1 True then Hybrid

How it should look:

customer PersonId New_Column
afabd2d2 4 Hybrid
afabd2d2 8 Online
afabd2d2 3 Online
afabd2d2 2 Offline

Advertisement

Answer

Group by customer + personId and use when expression to check if all values in is_online_store column are true / false or a mix of the 2, using for example bool_and function:

from pyspark.sql import functions as F

df1 = df.groupBy("customer", "PersonId").agg(
    F.when(F.expr("bool_and(is_online_store)"), "Online")
     .when(F.expr("bool_and(!is_online_store)"), "Offline")
     .otherwise("Hybrid").alias("New_Column")
)

df1.show()
#+--------+--------+----------+
#|customer|PersonId|New_Column|
#+--------+--------+----------+
#|afabd2d2|       2|   Offline|
#|afabd2d2|       8|    Online|
#|afabd2d2|       4|    Hybrid|
#|afabd2d2|       3|    Online|
#+--------+--------+----------+
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement