Taking the recency time of client purchase with PySpark

Question

The sample of the dataset I am working on: I'd like to take the customer's most recent purchase (the customer's date.max()) and the previous maximum purchase (the penultimate purchase) and take the difference between the two (I'm assuming the same product on all purchases). I still haven't found something in pyspark that does this. One example of my idea was

Accepted Answer

Use group by with collect_list to collect all dates per groupUse reverse/array_sort to enforce descending order of the array.Reference the first and second purchases.  (We hope they have purchased twice or we&#8217;d need more complex logic to handle it.)df1.groupBy(["Client"]).agg(   reverse( # collect all the dates in an array in descending order from most recent to oldest.    array_sort(      collect_list( df1.data ))).alias("dates") ).select(   col("Client"),   col("dates")[0].alias("last_purchase"), # get last purchase  col("dates")[1].alias("penultimate")   ) #get last purchase + 1.show()+------+-------------+-----------+|Client|last_purchase|penultimate|+------+-------------+-----------+|     1|   2021-11-08| 2021-11-06||     3|   2021-08-02| 2021-05-08||     2|   2021-11-01| 2021-10-20|+------+-------------+-----------+

Advertisement

Answer