Is there a way to get the most 30 recent days worth of records for each grouping of data in Pyspark? In this example, get the 2 records with the most recent dates within the groupings of (Grouping, Bucket). So a table like this
JavaScript
x
12
12
1
| Grouping | Bucket | Date |
2
|----------|--------|-----------|
3
| 1 | 1 | 9-20-2020 |
4
| 1 | 1 | 9-16-2020 |
5
| 1 | 1 | 8-14-2020 |
6
| 1 | 2 | 9-20-2020 |
7
| 1 | 2 | 9-16-2020 |
8
| 1 | 2 | 8-14-2020 |
9
| 2 | 1 | 9-20-2020 |
10
| 2 | 1 | 9-16-2020 |
11
| 2 | 1 | 8-14-2020 |
12
Would turn into this:
JavaScript
1
9
1
| Grouping | Bucket | Date |
2
|----------|--------|-----------|
3
| 1 | 1 | 9-20-2020 |
4
| 1 | 1 | 9-16-2020 |
5
| 1 | 2 | 9-20-2020 |
6
| 1 | 2 | 9-16-2020 |
7
| 2 | 1 | 9-20-2020 |
8
| 2 | 1 | 9-16-2020 |
9
Edit: I reviewed my question after edit and realized that not doing the edit to begin with was the right choice
Advertisement
Answer
Use a window and take the top two ranks within each window:
JavaScript
1
6
1
df.withColumn("r", F.rank().over(Window.partitionBy("Grouping", "Bucket")
2
.orderBy(F.col("Date").desc())))
3
.filter("r <=2")
4
.drop("r")
5
.show()
6
Output:
JavaScript
1
11
11
1
+--------+------+----------+
2
|Grouping|Bucket| Date|
3
+--------+------+----------+
4
| 1| 1|2020-09-20|
5
| 1| 1|2020-09-16|
6
| 1| 2|2020-09-20|
7
| 1| 2|2020-09-16|
8
| 2| 1|2020-09-20|
9
| 2| 1|2020-09-16|
10
+--------+------+----------+
11
Edit: this answer applies to this revision (get the most recent N records for each group).