I have a data frame as-
my_dt = pl.DataFrame({'last_name':['mallesh','bhavik','jagarini','mallesh','jagarini'],
'first_name':['yamulla','vemulla','yegurla','yamulla','yegurla'],
'ssn':['1234','7847','0648','4567','0648']})
Here I would like to find out duplicates considering last_name and firs_name columns and if any duplicates found their respective ssn needs to be rolled up with semicolon(;) if SSN are not different. if SSN are also same only one SSN needs to be present.
the expected output as:
Here since mallesh yamulla is duplicated and has different SSN’s they are rolled up with ‘;’
and in case of jagarini yegurla it has a unique SSN hence one SSN is only taken.
Added one more case:
Here on given any set of column it should rollup the unique values using ; from the remaining columns. here on last and first name, roll up should be done on both DOB and SSN.
my_dt = pl.DataFrame({'last_name':['mallesh','bhavik','jagarini','mallesh','jagarini'],
'first_name':['yamulla','vemulla','yegurla','yamulla','yegurla'],
'ssn':['1234','7847','0648','4567','0648'],
'dob':['10/11/1990','09/16/1991','01/01/1990','10/11/1990','02/14/1983'] })
Another case as:
my_dt = pl.DataFrame({'last_name':['mallesh','bhavik','jagarini','mallesh','jagarini'],
'first_name':['yamulla','vemulla','yegurla','yamulla','yegurla'],
'ssn':['1234','7847','0648','4567','0648'],
'dob':['10/11/1990','09/16/1991','01/01/1990','','02/14/1983'] })
In case of having null values in a field it should treat as empty not as a value.
“;10/11/1990” it should just be “10/11/1990” for mallesh yamulla entry.
Advertisement
Answer
Use a group_by
and unique
to remove duplicates. From there, you can use arr.join
on the resulting list.
(
my_dt
.groupby(['last_name', 'first_name'])
.agg([
pl.col('ssn').unique()
])
.with_column(
pl.col('ssn').arr.join(';')
)
)
shape: (3, 3)
┌───────────┬────────────┬───────────┐
│ last_name ┆ first_name ┆ ssn │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════╪════════════╪═══════════╡
│ mallesh ┆ yamulla ┆ 4567;1234 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ bhavik ┆ vemulla ┆ 7847 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ jagarini ┆ yegurla ┆ 0648 │
└───────────┴────────────┴───────────┘
Edit: if you want to ensure that the rolled up list is sorted:
(
my_dt
.groupby(['last_name', 'first_name'])
.agg([
pl.col('ssn')
.unique()
.sort()
])
.with_column(
pl.col('ssn')
.arr.join(';')
)
)
shape: (3, 3)
┌───────────┬────────────┬───────────┐
│ last_name ┆ first_name ┆ ssn │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════╪════════════╪═══════════╡
│ jagarini ┆ yegurla ┆ 0648 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ mallesh ┆ yamulla ┆ 1234;4567 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ bhavik ┆ vemulla ┆ 7847 │
└───────────┴────────────┴───────────┘
Edit: Rolling up multiple columns
We can roll up multiple columns elegantly as follows:
(
my_dt
.groupby(["last_name", "first_name"])
.agg([
pl.all().unique().sort().cast(pl.Utf8)
])
.with_columns([
pl.exclude(['last_name', 'first_name']).arr.join(";")
])
)
shape: (3, 4)
┌───────────┬────────────┬───────────┬───────────────────────┐
│ last_name ┆ first_name ┆ ssn ┆ dob │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞═══════════╪════════════╪═══════════╪═══════════════════════╡
│ bhavik ┆ vemulla ┆ 7847 ┆ 1991-09-16 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ jagarini ┆ yegurla ┆ 0648 ┆ 1983-02-14;1990-01-01 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mallesh ┆ yamulla ┆ 1234;4567 ┆ 1990-10-11 │
└───────────┴────────────┴───────────┴───────────────────────┘
Edit: eliminating empty strings and null
values from rollup
We can add a filter step just before the arr.join
to filter out both null
and empty string ""
values.
(
my_dt.groupby(["last_name", "first_name"])
.agg([pl.all().unique().sort().cast(pl.Utf8)])
.with_columns(
[
pl.exclude(["last_name", "first_name"])
.arr.eval(
pl.element().filter(pl.element().is_not_null() & (pl.element() != ""))
)
.arr.join(";")
]
)
)