Pyspark get top two values in column from a group based on ordering

Question

I am trying to get the first two counts that appear in this list, by the earliest log_date they appeared. In this case my expected output is: This is what I have working but there are a few edge cases where count could go down and then back up, shown in the example above. This code returns 2021-07-11 as the

Accepted Answer

Your intuition was quite correct, here is a possible implementationimport pyspark.sql.functions as Ffrom pyspark.sql.window import Window# define some windows for laterw_date = Window.partitionBy('state').orderBy(F.desc('log_date'))w_rn = Window.partitionBy('state').orderBy('rn')w_grp = Window.partitionBy('state', 'grp')df = df  .withColumn('rn', F.row_number().over(w_date))  .withColumn('changed', (F.col('count') != F.lag('count', 1, 0).over(w_rn)).cast('int'))  .withColumn('grp', F.sum('changed').over(w_rn))  .filter(F.col('grp') <= 2)  .withColumn('min_date', F.col('log_date') == F.min('log_date').over(w_grp))  .filter(F.col('min_date') == True)  .drop('rn', 'changed', 'grp', 'min_date')      df.show()+-----+-----+----------+|state|count|  log_date|+-----+-----+----------+|   GU| 7402|2021-07-16||   GU| 7397|2021-07-13|+-----+-----+----------+

Advertisement

Answer