How to get the N most recent dates in Pyspark

Question

Is there a way to get the most 30 recent days worth of records for each grouping of data in Pyspark? In this example, get the 2 records with the most recent dates within the groupings of (Grouping, Bucket). So a table like this Would turn into this: Edit: I reviewed my question after edit and realized that not doing

Accepted Answer

Use a window and take the top two ranks within each window:df.withColumn("r", F.rank().over(Window.partitionBy("Grouping", "Bucket")                            .orderBy(F.col("Date").desc())))     .filter("r <=2")     .drop("r")     .show()Output:+--------+------+----------+|Grouping|Bucket|      Date|+--------+------+----------+|       1|     1|2020-09-20||       1|     1|2020-09-16||       1|     2|2020-09-20||       1|     2|2020-09-16||       2|     1|2020-09-20||       2|     1|2020-09-16|+--------+------+----------+Edit: this answer applies to this revision (get the most recent N records for each group).

Advertisement

Answer