Skip to content
Advertisement

How to get the N most recent dates in Pyspark

Is there a way to get the most 30 recent days worth of records for each grouping of data in Pyspark? In this example, get the 2 records with the most recent dates within the groupings of (Grouping, Bucket). So a table like this

JavaScript

Would turn into this:

JavaScript

Edit: I reviewed my question after edit and realized that not doing the edit to begin with was the right choice

Advertisement

Answer

Use a window and take the top two ranks within each window:

JavaScript

Output:

JavaScript

Edit: this answer applies to this revision (get the most recent N records for each group).

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement