Is there a way to get the most 30 recent days worth of records for each grouping of data in Pyspark? In this example, get the 2 records with the most recent dates within the groupings of (Grouping, Bucket). So a table like this
| Grouping | Bucket | Date | |----------|--------|-----------| | 1 | 1 | 9-20-2020 | | 1 | 1 | 9-16-2020 | | 1 | 1 | 8-14-2020 | | 1 | 2 | 9-20-2020 | | 1 | 2 | 9-16-2020 | | 1 | 2 | 8-14-2020 | | 2 | 1 | 9-20-2020 | | 2 | 1 | 9-16-2020 | | 2 | 1 | 8-14-2020 |
Would turn into this:
| Grouping | Bucket | Date | |----------|--------|-----------| | 1 | 1 | 9-20-2020 | | 1 | 1 | 9-16-2020 | | 1 | 2 | 9-20-2020 | | 1 | 2 | 9-16-2020 | | 2 | 1 | 9-20-2020 | | 2 | 1 | 9-16-2020 |
Edit: I reviewed my question after edit and realized that not doing the edit to begin with was the right choice
Advertisement
Answer
Use a window and take the top two ranks within each window:
df.withColumn("r", F.rank().over(Window.partitionBy("Grouping", "Bucket") .orderBy(F.col("Date").desc()))) .filter("r <=2") .drop("r") .show()
Output:
+--------+------+----------+ |Grouping|Bucket| Date| +--------+------+----------+ | 1| 1|2020-09-20| | 1| 1|2020-09-16| | 1| 2|2020-09-20| | 1| 2|2020-09-16| | 2| 1|2020-09-20| | 2| 1|2020-09-16| +--------+------+----------+
Edit: this answer applies to this revision (get the most recent N records for each group).