I would like to groupby the data in interval of a hourly/daily/weekly and further group by certain other clauses. I was able to acheive groupby hourly/daily/weekly basis by using groupby_dynamic option provided by polars.
How do we add a secondary non datetime groupby clause to the polars dataframe after using groupby_dynamic operation in polar?
The sample dataframe read from csv is
┌─────────-─────────-─┬─────────────┬────────┬─────────┬──-───────────┐ | Date ┆ Item ┆ Issue ┆ Channel ┆ ID │ |═════════════════════|═════════════|════════|═════════|══════════════| | 2023-01-02 01:00:00 ┆ Item ABC ┆ EAAGCD ┆ Twitter ┆ 32513995 │ | 2023-01-02 01:40:00 ┆ Item ABC ┆ ASDFFF ┆ Web ┆ 32513995 │ | 2023-01-02 02:15:00 ┆ Item ABC ┆ WERWET ┆ Web ┆ 32513995 │ | 2023-01-02 03:00:00 ┆ Item ABC ┆ BVRTNB ┆ Twitter ┆ 32513995 │ | 2023-01-03 04:11:00 ┆ Item ABC ┆ VDFGVS ┆ Fax ┆ 32513995 │ | 2023-01-03 04:30:00 ┆ Item ABC ┆ QWEDWE ┆ Twitter ┆ 32513995 │ | 2023-01-03 04:45:00 ┆ Item ABC ┆ BRHMNU ┆ Fax ┆ 32513995 │ └─────────────────────┴─────────────┴────────┴─────────┴──────────────┘
I am grouping this data in houlry interval using polars groupby_dynamic operation using the below code snippet.
import polars as pl q = ( pl.scan_csv("Test.csv", parse_dates=True) .filter(pl.col("Item") == "Item ABC") .groupby_dynamic("Date", every="1h", closed="right") .agg([pl.col("ID").count().alias("total")]) .sort(["Date"]) ) df = q.collect()
This code gives me result as
┌─────────────────────┬───────┐ │ Date ┆ total │ ╞═════════════════════╪═══════╡ │ 2023-01-02 01:00:00 ┆ 2 │ │ 2023-01-02 02:00:00 ┆ 1 │ │ 2023-01-02 03:00:00 ┆ 1 │ │ 2023-01-05 04:00:00 ┆ 3 │ └─────────────────────┴───────┘
But i would want to further group by this data by “Channel” and expecting the result as
┌────────────-──────-─┬─────────┬───────┐ │ Date ┆ Channel ┆ total │ ╞═════════════════════╪═════════╪═══════╡ │ 2023-01-02 01:00:00 ┆ Twitter ┆ 1 │ │ 2023-01-02 01:00:00 ┆ Web ┆ 1 │ │ 2023-01-02 01:00:00 ┆ Web ┆ 1 │ │ 2023-01-02 01:00:00 ┆ Twitter ┆ 1 │ │ 2023-01-03 01:00:00 ┆ Fax ┆ 2 │ │ 2023-01-11 01:00:00 ┆ Twitter ┆ 1 │ └─────────────────────┴─────────┴───────┘
Advertisement
Answer
You can specify by
q = ( pl.scan_csv("Test.csv", parse_dates=True) .filter(pl.col("Item") == "Item ABC") .groupby_dynamic("Date", every="1h", closed="right", by="Item") .agg([pl.col("ID").count().alias("total")]) .sort(["Date"]) )