Skip to content
Advertisement

Taking the recency time of client purchase with PySpark

The sample of the dataset I am working on:

JavaScript

enter image description here

I’d like to take the customer’s most recent purchase (the customer’s date.max()) and the previous maximum purchase (the penultimate purchase) and take the difference between the two (I’m assuming the same product on all purchases). I still haven’t found something in pyspark that does this. One example of my idea was to do this in a groupby, like a below with the minimum date and maximum date.

JavaScript

The output is:

enter image description here

For my problem, the output would be the maximum date and the penultimate purchase date from that customer. The output should be:

enter image description here

Another method could also be directly delivering the difference between these two dates. But I have no idea how to implement it.

Thank you very much in advance.

Advertisement

Answer

Use group by with collect_list to collect all dates per group

Use reverse/array_sort to enforce descending order of the array.

Reference the first and second purchases. (We hope they have purchased twice or we’d need more complex logic to handle it.)

JavaScript
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement