Skip to content
Advertisement

Pyspark create sliding windows from rows with padding

I’m trying to collect groups of rows into sliding windows represented as vectors.

Given the example input:

JavaScript

An expected output would be:

JavaScript

My latest attempt produces tumbling windows without padding. Here’s my code:

JavaScript

I tried looking for variations of this, maybe by performing a SQL query like in this case or with some built-in SQL function such as ROWS N PRECEDING, but I didn’t manage to do what I want. Most results from the web focus on temporal sliding windows, but I’m trying to do it over rows instead.

Any help would be greatly appreciated.

EDIT:
I think I found a solution for the padding thanks to this answer.

I still need to organize the rows in sliding windows though…

Advertisement

Answer

One possible solution (not the most elegant one, but still functional) is the following.
In the window definition, it uses .rowsBetween to create a sliding window of the specified size; 0 indicates the current row.

JavaScript

I suggest you to go through the solution step-by-step, one code line at a time, to understand the logic behind it.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement