In my DataFrame, I have many instances of same AutoNumber
having different KeyValue_String
. I would like to convert these instances to a single row where the KeyValue_String
is a list comprised of the multiple unique values.
AutoNumber KeyValue_String ReferralType Description 0 50899 DD 3 Web Search 1 50905 Cheque 1 Gatestone Collections 2 50906 DD 2 Centum Mortgage Brokers 3 50907 Cheque 1 Financial Debt Recovery Ltd. 4 50908 DD 2 Centum Mortgage Brokers 5 50909 DD 2 Centum Mortgage Brokers 6 50910 Cheque 1 Allied International Credit 7 50911 Cheque 1 D&A Collection Corp 8 50912 Cheque 1 Gatestone Collections 9 50913 Cheque 1 Financial Debt Recovery Ltd. 10 50914 Cheque 3 Existing Customer - Refinancing 11 50914 DD 3 Existing Customer - Refinancing 12 50915 Cheque 1 Gatestone Collections 13 50916 Cheque 3 Existing Customer - Refinancing 14 50916 Cheque 3 Existing Customer - Refinancing
The desired output would look like this, except I want to keep all of the other columns
AutoNumber KeyValue_String 0 50899 DD 1 50905 Cheque 2 50906 DD 3 50907 Cheque 4 50908 DD 5 50909 DD 6 50910 Cheque 7 50911 Cheque 8 50912 Cheque 9 50913 Cheque 10 50914 [Cheque, DD] 11 50915 Cheque 12 50916 Cheque 13 50917 Cheque 14 50918 Cheque
Advertisement
Answer
If I understand correctly, you could opt for using groupby
, transform
, and unique
.
df['KeyValue_String'] = df.groupby('AutoNumber').KeyValue_String.transform('unique')
Then you can drop duplicates assuming as mentioned in the comments that rows with the same AutoNumber contain duplicate information besides the KeyValue_String.
df = df.drop_duplicates(subset='AutoNumber')
I would advise if you want arrays you keep everything in the column as an array, and don’t expend effort putting mixed types in the column which will just be harder to work with anyways.
Demo
>>> df AutoNumber KeyValue_String 0 50899 DD 1 50905 Cheque 2 50906 DD 3 50907 Cheque 4 50908 DD 5 50909 DD 6 50910 Cheque 7 50911 Cheque 8 50912 Cheque 9 50913 Cheque 10 50914 Cheque 11 50914 DD 12 50915 Cheque 13 50916 Cheque 14 50916 Cheque >>> df['KeyValue_String'] = df.groupby('AutoNumber').KeyValue_String.transform('unique') >>> df.drop_duplicates(subset='AutoNumber') AutoNumber KeyValue_String 0 50899 [DD] 1 50905 [Cheque] 2 50906 [DD] 3 50907 [Cheque] 4 50908 [DD] 5 50909 [DD] 6 50910 [Cheque] 7 50911 [Cheque] 8 50912 [Cheque] 9 50913 [Cheque] 10 50914 [Cheque, DD] 12 50915 [Cheque] 13 50916 [Cheque]