Skip to content
Advertisement

Summarizing labels at time steps based on current and past info

Given the following input dataframe

JavaScript

A dataframe which looks like this needs to be constructed

JavaScript

The input dataframe has 10s of millions of records.

Some details which are seen in the example above (by design)

  • npos is the size of the vector to be constructed in the output
  • pos is guaranteed to be in [0,npos)
  • at each time step (elap) there will be at most 1 label for a pos
  • if lbl is not given at a time step it has to be inferred from the last time it was specified for that pos
  • if lbl is not previously specified, it can be assumed to be 0

Advertisement

Answer

You can use some higher-order functions on arrays to achieve that:

  1. add vec column using array_repeat function and initialize pos value from lbl
  2. use collect_list to get cumulative vec over window partitioned by id
  3. aggregate the resulting array by selecting previous positions if it is different from 0
JavaScript
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement