Skip to content
Advertisement

Is there a Scala Spark equivalent to pandas Grouper freq feature?

In pandas, if we have a time series and need to group it by a certain frequency (say, every two weeks), it’s possible to use the Grouper class, like this:

import pandas as pd
df.groupby(pd.Grouper(key='timestamp', freq='2W'))

Is there any equivalent in Spark (more specifically, using Scala) for this feature?

Advertisement

Answer

You can use the sql function window. First, you create the timestamp column, if you donĀ“t have any yet, from a string type datetime:

val data =
  Seq(("2022-01-01 00:00:00", 1),
      ("2022-01-01 00:15:00", 1),
      ("2022-01-08 23:30:00", 1),
      ("2022-01-22 23:30:00", 4))

Then, apply the window function to the timestamp column, and do the aggregation to the column you need to obtain a result per slot:

val df0 = 
  df.groupBy(window(col("date"), "1 week", "1 week", "0 minutes"))
    .agg(sum("a") as "sum_a")

The result includes the calculated windows. Take a look to the doc for a better understanding of the input parameters: https://spark.apache.org/docs/latest/api/sql/index.html#window.

val df1 = df0.select("window.start", "window.end", "sum_a")

df1.show()

it gives:

+-------------------+-------------------+-----+
|              start|                end|sum_a|
+-------------------+-------------------+-----+
|2022-01-20 01:00:00|2022-01-27 01:00:00|    4|
|2021-12-30 01:00:00|2022-01-06 01:00:00|    2|
|2022-01-06 01:00:00|2022-01-13 01:00:00|    1|
+-------------------+-------------------+-----+
Advertisement