How to perform split/merge/melt with Python and polars?

Question

I have a data transformation problem where the original data consists of &#8220;blocks&#8221; of three rows of data, where the first row denotes a &#8216;parent&#8217; and the two others are related children. A minimum working example looks like this: In reality, there are up to 15 Providers (so up to 30 colu…

Accepted Answer

Here&#8217;s how I&#8217;ve attempted it:fill the nulls in the Parent Order ID column and use that to .groupby()>>> columns = ["Order ID", "Direction", "Price", "Some Value"]... names   = pl.col("^Name .*$")   # All name columns... quotes  = pl.col("^Quote .*$")  # All quote columns... (...    df_original_two_orders...    .with_column(pl.col("Parent Order ID").backward_fill())...    .groupby("Parent Order ID")...    .agg([...       pl.col(columns).first(),...       pl.concat_list(names.first()).alias("Name"),  # Put all names into single column:  ["Name1", "Name2", ...]...       pl.col("^Quote .*$").slice(1),                # Create list for each quote column (skip first row): [1.1, 1.3], [1.15, 1.25], ......    ])...    .with_columns([...       pl.concat_list(                               # Create list of Buy values...          pl.when(pl.col("Direction") == "Buy")...            .then(quotes.arr.first())...            .otherwise(quotes.arr.last())...          .alias("Buy")),...       pl.concat_list(                               # Create list of Sell values...          pl.when(pl.col("Direction") == "Sell")...            .then(quotes.arr.first())...            .otherwise(quotes.arr.last())...          .alias("Sell")...       )...    ])...    .select(columns + ["Name", "Buy", "Sell"])       # Remove Name/Quote [1234..] columns...    .explode(["Name", "Buy", "Sell"])                # Turn into rows... )shape: (8, 7)┌──────────┬───────────┬─────────┬────────────┬──────┬──────┬──────┐│ Order ID | Direction | Price   | Some Value | Name | Buy  | Sell ││ ---      | ---       | ---     | ---        | ---  | ---  | ---  ││ str      | str       | f64     | i64        | str  | f64  | f64  │╞══════════╪═══════════╪═════════╪════════════╪══════╪══════╪══════╡│ B        | Sell      | 1.1384  | 42         | P2   | 1.4  | 1.1  │├──────────┼───────────┼─────────┼────────────┼──────┼──────┼──────┤│ B        | Sell      | 1.1384  | 42         | P1   | 1.39 | 1.11 │├──────────┼───────────┼─────────┼────────────┼──────┼──────┼──────┤│ B        | Sell      | 1.1384  | 42         | P3   | 1.55 | 1.05 │├──────────┼───────────┼─────────┼────────────┼──────┼──────┼──────┤│ B        | Sell      | 1.1384  | 42         | null | null | null │├──────────┼───────────┼─────────┼────────────┼──────┼──────┼──────┤│ A        | Buy       | 1.21003 | 4          | P8   | 1.1  | 1.3  │├──────────┼───────────┼─────────┼────────────┼──────┼──────┼──────┤│ A        | Buy       | 1.21003 | 4          | P2   | 1.15 | 1.25 │├──────────┼───────────┼─────────┼────────────┼──────┼──────┼──────┤│ A        | Buy       | 1.21003 | 4          | P1   | 1.0  | 1.4  │├──────────┼───────────┼─────────┼────────────┼──────┼──────┼──────┤│ A        | Buy       | 1.21003 | 4          | P5   | 1.0  | 1.4  │└─//───────┴─//────────┴─//──────┴─//─────────┴─//───┴─//───┴─//───┘Explanation:Step 1 creates a list of names and puts each quote into a list:>>> columns = ["Order ID", "Direction", "Price", "Some Value"]... names   = pl.col("^Name .*$")   # All name columns... quotes  = pl.col("^Quote .*$")  # All quote columns... agg = (...    df_original_two_orders...    .with_column(pl.col("Parent Order ID").backward_fill())...    .groupby("Parent Order ID")...    .agg([...       pl.col(columns).first(),...       pl.concat_list(names.first()).alias("Name"),  # Put all names into single column:  ["Name1", "Name2", ...]...       pl.col("^Quote .*$").slice(1),                # Create list for each quote column (skip first row): [1.1, 1.3], [1.15, 1.25], ......    ])... )>>> aggshape: (2, 10)┌─────────────────┬──────────┬───────────┬─────────┬────────────┬────────────────────────┬──────────────────┬──────────────────┬──────────────────┬──────────────────┐│ Parent Order ID | Order ID | Direction | Price   | Some Value | Name                   | Quote Provider 1 | Quote Provider 2 | Quote Provider 3 | Quote Provider 4 ││ ---             | ---      | ---       | ---     | ---        | ---                    | ---              | ---              | ---              | ---              ││ str             | str      | str       | f64     | i64        | list[str]              | list[f64]        | list[f64]        | list[f64]        | list[f64]        │╞═════════════════╪══════════╪═══════════╪═════════╪════════════╪════════════════════════╪══════════════════╪══════════════════╪══════════════════╪══════════════════╡│ A               | A        | Buy       | 1.21003 | 4          | ["P8", "P2", ... "P5"] | [1.1, 1.3]       | [1.15, 1.25]     | [1.0, 1.4]       | [1.0, 1.4]       │├─────────────────┼──────────┼───────────┼─────────┼────────────┼────────────────────────┼──────────────────┼──────────────────┼──────────────────┼──────────────────┤│ B               | B        | Sell      | 1.1384  | 42         | ["P2", "P1", ... null] | [1.1, 1.4]       | [1.11, 1.39]     | [1.05, 1.55]     | [null, null]     │└─//──────────────┴─//───────┴─//────────┴─//──────┴─//─────────┴─//─────────────────────┴─//───────────────┴─//───────────────┴─//───────────────┴─//───────────────┘Step 2 creates separate Buy/Sell lists from the Quote columns.We can use pl.when().then().otherwise() to test if we should take the first/last value in each Quote list depending if the Direction is Buy/Sell.>>> (...    agg...    .with_columns([...       pl.concat_list(                               # Create list of Buy values...          pl.when(pl.col("Direction") == "Buy")...            .then(quotes.arr.first())...            .otherwise(quotes.arr.last())...          .alias("Buy")),...       pl.concat_list(                               # Create list of Sell values...          pl.when(pl.col("Direction") == "Sell")...            .then(quotes.arr.first())...            .otherwise(quotes.arr.last())...          .alias("Sell")...       )...    ])...    .select(columns + ["Name", "Buy", "Sell"])... )shape: (2, 7)┌──────────┬───────────┬─────────┬────────────┬────────────────────────┬───────────────────────┬───────────────────────┐│ Order ID | Direction | Price   | Some Value | Name                   | Buy                   | Sell                  ││ ---      | ---       | ---     | ---        | ---                    | ---                   | ---                   ││ str      | str       | f64     | i64          list[str]              | list[f64]             | list[f64]             │╞══════════╪═══════════╪═════════╪════════════╪════════════════════════╪═══════════════════════╪═══════════════════════╡│ A        | Buy       | 1.21003 | 4          | ["P8", "P2", ... "P5"] | [1.1, 1.15, ... 1.0]  | [1.3, 1.25, ... 1.4]  │├──────────┼───────────┼─────────┼────────────┼────────────────────────┼───────────────────────┼───────────────────────┤│ B        | Sell      | 1.1384  | 42         | ["P2", "P1", ... null] | [1.4, 1.39, ... null] | [1.1, 1.11, ... null] │└─//───────┴─//────────┴─//──────┴─//─────────┴─//─────────────────────┴─//────────────────────┴─//────────────────────┘-Finally we .explode() to turn the lists into rows.You can add a .drop_nulls() afterwards to remove the null rows if desired.

Advertisement

Answer