Summarizing labels at time steps based on current and past info

Question

Given the following input dataframe A dataframe which looks like this needs to be constructed The input dataframe has 10s of millions of records. Some details which are seen in the example above (by design) npos is the size of the vector to be constructed in the output pos is guaranteed to be in [0,npos) at each time step (elap)

Accepted Answer

You can use some higher-order functions on arrays to achieve that:add vec column using array_repeat function and initialize pos value from lbluse collect_list to get cumulative vec over window partitioned by idaggregate the resulting array by selecting previous positions if it is different from 0from pyspark.sql import Windowimport pyspark.sql.functions as Fnpos = 3out = inp.withColumn(    "vec",    F.expr(f"transform(array_repeat(0, {npos}), (x, i) -> IF(i=pos, lbl, x))")).withColumn(    "vec",    F.collect_list("vec").over(Window.partitionBy("id").orderBy("elap"))).withColumn(    "vec",    F.expr(f"""aggregate(                  vec,                   array_repeat(0, {npos}),                  (acc, x) -> transform(acc, (y, i) -> int(IF(x[i]!=0, x[i], y)))            )""")).drop("lbl", "pos")out.show(truncate=False)#+---+----+---------+#|id |elap|vec      |#+---+----+---------+#|1  |23  |[2, 0, 0]|#|1  |45  |[2, 2, 0]|#|1  |89  |[2, 3, 0]|#|1  |95  |[4, 3, 2]|#|1  |95  |[4, 3, 2]|#|2  |20  |[0, 0, 2]|#|2  |40  |[0, 4, 2]|#+---+----+---------+

Advertisement

Answer