In pandas, if we have a time series and need to group it by a certain frequency (say, every two weeks), it’s possible to use the Grouper
class, like this:
JavaScript
x
3
1
import pandas as pd
2
df.groupby(pd.Grouper(key='timestamp', freq='2W'))
3
Is there any equivalent in Spark (more specifically, using Scala) for this feature?
Advertisement
Answer
You can use the sql function window. First, you create the timestamp column, if you don´t have any yet, from a string type datetime:
JavaScript
1
6
1
val data =
2
Seq(("2022-01-01 00:00:00", 1),
3
("2022-01-01 00:15:00", 1),
4
("2022-01-08 23:30:00", 1),
5
("2022-01-22 23:30:00", 4))
6
Then, apply the window function to the timestamp column, and do the aggregation to the column you need to obtain a result per slot:
JavaScript
1
4
1
val df0 =
2
df.groupBy(window(col("date"), "1 week", "1 week", "0 minutes"))
3
.agg(sum("a") as "sum_a")
4
The result includes the calculated windows. Take a look to the doc for a better understanding of the input parameters: https://spark.apache.org/docs/latest/api/sql/index.html#window.
JavaScript
1
4
1
val df1 = df0.select("window.start", "window.end", "sum_a")
2
3
df1.show()
4
it gives:
JavaScript
1
8
1
+-------------------+-------------------+-----+
2
| start| end|sum_a|
3
+-------------------+-------------------+-----+
4
|2022-01-20 01:00:00|2022-01-27 01:00:00| 4|
5
|2021-12-30 01:00:00|2022-01-06 01:00:00| 2|
6
|2022-01-06 01:00:00|2022-01-13 01:00:00| 1|
7
+-------------------+-------------------+-----+
8