Skip to content
Advertisement

Pyspark get top two values in column from a group based on ordering

I am trying to get the first two counts that appear in this list, by the earliest log_date they appeared.

JavaScript

In this case my expected output is:

JavaScript

This is what I have working but there are a few edge cases where count could go down and then back up, shown in the example above. This code returns 2021-07-11 as the earliest date for count=7402.

JavaScript

I think what I need to do is select the first two counts based on sorting by state and log_date(desc), then get the min log_date for each count. I thought rank() might work here by taking the highest rank for each count, but I am stumped on how to apply it for this situation. No matter what I try I haven’t been able to get rid of the last two count=7402 records. Maybe there is an easier way that I am overlooking?

JavaScript

Advertisement

Answer

Your intuition was quite correct, here is a possible implementation

JavaScript
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement