Skip to content
Advertisement

How to roll up duplicate observation in Python polars?

I have a data frame as-

JavaScript

Here I would like to find out duplicates considering last_name and firs_name columns and if any duplicates found their respective ssn needs to be rolled up with semicolon(;) if SSN are not different. if SSN are also same only one SSN needs to be present.

enter image description here

the expected output as:

enter image description here

Here since mallesh yamulla is duplicated and has different SSN’s they are rolled up with ‘;’

and in case of jagarini yegurla it has a unique SSN hence one SSN is only taken.

enter image description here

Added one more case:

Here on given any set of column it should rollup the unique values using ; from the remaining columns. here on last and first name, roll up should be done on both DOB and SSN.

JavaScript

enter image description here

Another case as:

JavaScript

In case of having null values in a field it should treat as empty not as a value.

“;10/11/1990” it should just be “10/11/1990” for mallesh yamulla entry.

enter image description here

Advertisement

Answer

Use a group_by and unique to remove duplicates. From there, you can use arr.join on the resulting list.

JavaScript
JavaScript

Edit: if you want to ensure that the rolled up list is sorted:

JavaScript
JavaScript

Edit: Rolling up multiple columns

We can roll up multiple columns elegantly as follows:

JavaScript
JavaScript

Edit: eliminating empty strings and null values from rollup

We can add a filter step just before the arr.join to filter out both null and empty string "" values.

JavaScript
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement