Skip to content
Advertisement

Adding secondary group by clause using groupby_dynamic() operation in polar

I would like to groupby the data in interval of a hourly/daily/weekly and further group by certain other clauses. I was able to acheive groupby hourly/daily/weekly basis by using groupby_dynamic option provided by polars.

How do we add a secondary non datetime groupby clause to the polars dataframe after using groupby_dynamic operation in polar?

The sample dataframe read from csv is

┌─────────-─────────-─┬─────────────┬────────┬─────────┬──-───────────┐
| Date                ┆ Item        ┆ Issue  ┆ Channel ┆ ID           │
|═════════════════════|═════════════|════════|═════════|══════════════|
| 2023-01-02 01:00:00 ┆ Item ABC    ┆ EAAGCD ┆ Twitter ┆ 32513995     │
| 2023-01-02 01:40:00 ┆ Item ABC    ┆ ASDFFF ┆ Web     ┆ 32513995     │
| 2023-01-02 02:15:00 ┆ Item ABC    ┆ WERWET ┆ Web     ┆ 32513995     │
| 2023-01-02 03:00:00 ┆ Item ABC    ┆ BVRTNB ┆ Twitter ┆ 32513995     │
| 2023-01-03 04:11:00 ┆ Item ABC    ┆ VDFGVS ┆ Fax     ┆ 32513995     │
| 2023-01-03 04:30:00 ┆ Item ABC    ┆ QWEDWE ┆ Twitter ┆ 32513995     │
| 2023-01-03 04:45:00 ┆ Item ABC    ┆ BRHMNU ┆ Fax     ┆ 32513995     │
└─────────────────────┴─────────────┴────────┴─────────┴──────────────┘

I am grouping this data in houlry interval using polars groupby_dynamic operation using the below code snippet.

import polars as pl


q = (
    pl.scan_csv("Test.csv", parse_dates=True)
    .filter(pl.col("Item") == "Item ABC")
    .groupby_dynamic("Date", every="1h", closed="right")
    .agg([pl.col("ID").count().alias("total")])
    .sort(["Date"])
)

df = q.collect()

This code gives me result as

┌─────────────────────┬───────┐
│ Date                ┆ total │
╞═════════════════════╪═══════╡
│ 2023-01-02 01:00:00 ┆ 2     │
│ 2023-01-02 02:00:00 ┆ 1     │
│ 2023-01-02 03:00:00 ┆ 1     │
│ 2023-01-05 04:00:00 ┆ 3     │
└─────────────────────┴───────┘

But i would want to further group by this data by “Channel” and expecting the result as

┌────────────-──────-─┬─────────┬───────┐
│ Date                ┆ Channel ┆ total │
╞═════════════════════╪═════════╪═══════╡
│ 2023-01-02 01:00:00 ┆ Twitter ┆ 1     │
│ 2023-01-02 01:00:00 ┆ Web     ┆ 1     │
│ 2023-01-02 01:00:00 ┆ Web     ┆ 1     │
│ 2023-01-02 01:00:00 ┆ Twitter ┆ 1     │
│ 2023-01-03 01:00:00 ┆ Fax     ┆ 2     │
│ 2023-01-11 01:00:00 ┆ Twitter ┆ 1     │
└─────────────────────┴─────────┴───────┘

Advertisement

Answer

You can specify by

q = (
    pl.scan_csv("Test.csv", parse_dates=True)
    .filter(pl.col("Item") == "Item ABC")
    .groupby_dynamic("Date", every="1h", closed="right", by="Item")
    .agg([pl.col("ID").count().alias("total")])
    .sort(["Date"])
)
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement