collect_list by preserving order based on another variable

Question

I am trying to create a new column of lists in Pyspark using a groupby aggregation on existing set of columns. An example input data frame is provided below: The expected output is: The values within a list are sorted by the date. I tried using collect_list as follows: But collect_list doesn't guarantee order even if I sort the input

Accepted Answer

If you collect both dates and values as a list, you can sort the resulting column according to date using and udf, and then keep only the values in the result.import operatorimport pyspark.sql.functions as F# create list columngrouped_df = input_df.groupby("id")                .agg(F.collect_list(F.struct("date", "value"))                .alias("list_col"))# define udfdef sorter(l):  res = sorted(l, key=operator.itemgetter(0))  return [item[1] for item in res]sort_udf = F.udf(sorter)# testgrouped_df.select("id", sort_udf("list_col")   .alias("sorted_list"))   .show(truncate = False)+---+----------------+|id |sorted_list     |+---+----------------+|1  |[10, 5, 15, 20] ||2  |[100, 500, 1500]|+---+----------------+

Advertisement

Answer