PySpark – Selecting all rows within each group

Question

I have a dataframe similar to below. From the above dataframe, I would like to keep all rows upto the most recent sale relative to the date. So essentially, I will only have unique date for each row. In the case of above example, output would look like: Can you please guide on how can I go to this result?

Accepted Answer

There is probably many ways to achieve this, but one way is to use Window. With Window you can partition your data on one or more columns (in your case sale_date) and on top of that you can order the data within each partition by a specific column (in your case descending on sale, such that latest sale is first). So:from pyspark.sql.window import Windowfrom pyspark.sql.functions import descmy_window = Window.partitionBy("sale_date").orderBy(desc("sale"))What you can then do is to apply this Window on your DataFrame and apply one out of many Window-functions. One of the functions you can apply is row_number which for each partition, adds a row number to each row based on your orderBy. Like this:from pyspark.sql.functions import row_numberdf_out = df.withColumn("row_number",row_number().over(my_window))Which will result in that the last sale for each date will have row_number = 1. If you then filter on row_number=1 you will get the last sale for each group.So, the full code:from pyspark.sql.window import Windowfrom pyspark.sql.functions import row_number, desc, colmy_window = Window.partitionBy("sale_date").orderBy(desc("sale"))df_out = (        df        .withColumn("row_number",row_number().over(my_window))        .filter(col("row_number") == 1)        .drop("row_number")    )

Advertisement

Answer