Original dataframe:
member_id | AccessDate |
---|---|
111111 | 2020-02-03 |
111111 | 2022-03-05 |
222222 | 2015-03-04 |
333333 | 2021-11-23 |
333333 | 2021-11-24 |
333333 | 2021-11-25 |
333333 | 2022-10-11 |
333333 | 2022-10-12 |
333333 | 2022-10-13 |
333333 | 2022-07-07 |
444444 | 2019-01-21 |
444444 | 2019-04-21 |
444444 | 2019-04-22 |
444444 | 2019-04-23 |
444444 | 2019-04-24 |
444444 | 2019-05-05 |
444444 | 2019-05-06 |
444444 | 2019-05-07 |
Result dataframe:
member_id | Most_Consecutive_AccessDate | total |
---|---|---|
111111 | 2022-03-05 | 1 |
222222 | 2015-03-04 | 1 |
333333 | 2022-10-11, 2022-10-12, 2022-10-13 | 3 |
444444 | 2019-04-21, 2019-04-22, 2019-04-23, 2019-04-24 | 4 |
Describe my question:
- Each member has a unique id and Accessdate as the first table.
- If the Member don’t have consecutive day, must get the latest date.
- If the Member have consecutive day more than 1 period, must get the most consecutive streak.
- If the Member have the same number of consecutive day, must get the latest consecutive streak.
- Example from original table member 333333 have consecutive period as 2021-11-23, 2021-11-24, 2021-11-25 and 2022-10-11, 2022-10-12, 2022-10-13. We need only the last period.
I have tried to use it.
df = df.withColumn(‘Most_Consecutive_AccessDate’, collect_list(‘AccessDate’)
try to get the number in the list But it is too far from the expected result
ps. edit for easier reading and wrong typing
Advertisement
Answer
This would work: (Using “gap-and-islands” – will add details shortly)
df.select(F.col("member_id"), F.col("AccessDate").cast("Date")) .withColumn("row_num", F.row_number().over(Window().orderBy(F.lit('A')))) .withColumn("bucket", F.col("AccessDate")-F.col("row_num")) .groupBy("bucket", "member_id") .agg(F.max("AccessDate").alias("AccessDate"), F.count("*").alias("cnt")) .orderBy(F.col("cnt").desc(), F.col("AccessDate").desc()) .groupBy("member_id") .agg(F.first("cnt"), F.first("AccessDate")) .show();
To know more and learn about series problems – https://blog.jooq.org/how-to-find-the-longest-consecutive-series-of-events-in-sql/