How to roll up duplicate observation in Python polars?

Question

I have a data frame as- Here I would like to find out duplicates considering last_name and firs_name columns and if any duplicates found their respective ssn needs to be rolled up with semicolon(;) if SSN are not different. if SSN are also same only one SSN needs to be present. the expected output as: Here si…

Accepted Answer

Use a group_by and unique to remove duplicates.  From there, you can use arr.join on the resulting list.(    my_dt    .groupby(['last_name', 'first_name'])    .agg([        pl.col('ssn').unique()    ])    .with_column(        pl.col('ssn').arr.join(';')    ))shape: (3, 3)┌───────────┬────────────┬───────────┐│ last_name ┆ first_name ┆ ssn       ││ ---       ┆ ---        ┆ ---       ││ str       ┆ str        ┆ str       │╞═══════════╪════════════╪═══════════╡│ mallesh   ┆ yamulla    ┆ 4567;1234 │├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤│ bhavik    ┆ vemulla    ┆ 7847      │├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤│ jagarini  ┆ yegurla    ┆ 0648      │└───────────┴────────────┴───────────┘Edit: if you want to ensure that the rolled up list is sorted:(    my_dt    .groupby(['last_name', 'first_name'])    .agg([        pl.col('ssn')        .unique()        .sort()    ])    .with_column(        pl.col('ssn')        .arr.join(';')    ))shape: (3, 3)┌───────────┬────────────┬───────────┐│ last_name ┆ first_name ┆ ssn       ││ ---       ┆ ---        ┆ ---       ││ str       ┆ str        ┆ str       │╞═══════════╪════════════╪═══════════╡│ jagarini  ┆ yegurla    ┆ 0648      │├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤│ mallesh   ┆ yamulla    ┆ 1234;4567 │├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤│ bhavik    ┆ vemulla    ┆ 7847      │└───────────┴────────────┴───────────┘Edit: Rolling up multiple columnsWe can roll up multiple columns elegantly as follows:(    my_dt    .groupby(["last_name", "first_name"])    .agg([        pl.all().unique().sort().cast(pl.Utf8)    ])    .with_columns([        pl.exclude(['last_name', 'first_name']).arr.join(";")    ]))shape: (3, 4)┌───────────┬────────────┬───────────┬───────────────────────┐│ last_name ┆ first_name ┆ ssn       ┆ dob                   ││ ---       ┆ ---        ┆ ---       ┆ ---                   ││ str       ┆ str        ┆ str       ┆ str                   │╞═══════════╪════════════╪═══════════╪═══════════════════════╡│ bhavik    ┆ vemulla    ┆ 7847      ┆ 1991-09-16            │├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│ jagarini  ┆ yegurla    ┆ 0648      ┆ 1983-02-14;1990-01-01 │├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│ mallesh   ┆ yamulla    ┆ 1234;4567 ┆ 1990-10-11            │└───────────┴────────────┴───────────┴───────────────────────┘Edit: eliminating empty strings and null values from rollupWe can add a filter step just before the arr.join to filter out both null and empty string "" values.(    my_dt.groupby(["last_name", "first_name"])    .agg([pl.all().unique().sort().cast(pl.Utf8)])    .with_columns(        [            pl.exclude(["last_name", "first_name"])            .arr.eval(                pl.element().filter(pl.element().is_not_null() & (pl.element() != ""))            )            .arr.join(";")        ]    ))

How to roll up duplicate observation in Python polars?

Advertisement

Answer

Edit: Rolling up multiple columns

Edit: eliminating empty strings and `null` values from rollup