Skip to content
Advertisement

collect_list by preserving order based on another variable

I am trying to create a new column of lists in Pyspark using a groupby aggregation on existing set of columns. An example input data frame is provided below:

JavaScript

The expected output is:

JavaScript

The values within a list are sorted by the date.

I tried using collect_list as follows:

JavaScript

But collect_list doesn’t guarantee order even if I sort the input data frame by date before aggregation.

Could someone help on how to do aggregation by preserving the order based on a second (date) variable?

Advertisement

Answer

If you collect both dates and values as a list, you can sort the resulting column according to date using and udf, and then keep only the values in the result.

JavaScript
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement