Sum value between overlapping interval slices per group

Question

I have a pyspark dataframe as below: And I want to sum only consumption on overlapping interval slices per idx: Answer You can use sequence to expand the intervals into single days, explode the list of days and then sum the consumption for each timestamp and idx: Output: Remarks: sequence includes the last value of the interval, so one day

Accepted Answer

You can use sequence to expand the intervals into single days, explode the list of days and then sum the consumption for each timestamp and idx:from pyspark.sql import functions as Finput=spark.createDataFrame(df)input.withColumn("all_days", F.sequence("valid_from", F.date_sub("valid_to", 1 )))     .withColumn("timestamp", F.explode("all_days"))     .groupBy("idx", "timestamp").sum("consumption")     .withColumnRenamed("sum(consumption)", "consumption")     .join(input.select("idx", "valid_to").distinct().withColumnRenamed("idx", "idx2"),         (F.col("timestamp") == F.col("valid_to")) & (F.col("idx") == F.col("idx2")), "full_outer")     .withColumn("idx", F.coalesce("idx", "idx2"))     .withColumn("timestamp", F.coalesce("timestamp", "valid_to"))     .drop("idx2", "valid_to")     .fillna(0.0)     .orderBy("idx", "timestamp")     .show()Output:input=spark.createDataFrame(df)...+---+-------------------+-----------+|idx|          timestamp|consumption|+---+-------------------+-----------+|  0|2019-01-01 00:00:00|        5.0||  0|2019-01-02 00:00:00|        0.0||  1|2019-01-01 00:00:00|       10.0||  1|2019-01-02 00:00:00|       20.0||  1|2019-01-03 00:00:00|       50.0||  1|2019-01-04 00:00:00|       50.0||  1|2019-01-05 00:00:00|        0.0||  1|2019-01-06 00:00:00|       40.0||  1|2019-01-07 00:00:00|       40.0||  1|2019-01-08 00:00:00|        0.0|+---+-------------------+-----------+Remarks:sequence includes the last value of the interval, so one day has to be substracted from valid_to.the missing end dates of the intervals are then restored using a full join with the original valid_to values, filling up null values with 0.0.

Advertisement

Answer