Skip to content
Advertisement

Python/Pandas: If Column has multiple values, convert to single row with multiples values in list

In my DataFrame, I have many instances of same AutoNumber having different KeyValue_String. I would like to convert these instances to a single row where the KeyValue_String is a list comprised of the multiple unique values.

    AutoNumber KeyValue_String  ReferralType                      Description
0        50899              DD             3                       Web Search
1        50905          Cheque             1            Gatestone Collections
2        50906              DD             2          Centum Mortgage Brokers
3        50907          Cheque             1     Financial Debt Recovery Ltd.
4        50908              DD             2          Centum Mortgage Brokers
5        50909              DD             2          Centum Mortgage Brokers
6        50910          Cheque             1      Allied International Credit
7        50911          Cheque             1              D&A Collection Corp
8        50912          Cheque             1            Gatestone Collections
9        50913          Cheque             1     Financial Debt Recovery Ltd.
10       50914          Cheque             3  Existing Customer - Refinancing
11       50914              DD             3  Existing Customer - Refinancing
12       50915          Cheque             1            Gatestone Collections
13       50916          Cheque             3  Existing Customer - Refinancing
14       50916          Cheque             3  Existing Customer - Refinancing

The desired output would look like this, except I want to keep all of the other columns

      AutoNumber KeyValue_String
0          50899            DD
1          50905        Cheque
2          50906            DD
3          50907        Cheque
4          50908            DD
5          50909            DD
6          50910        Cheque
7          50911        Cheque
8          50912        Cheque
9          50913        Cheque
10         50914    [Cheque, DD]
11         50915        Cheque
12         50916        Cheque
13         50917        Cheque
14         50918        Cheque

Advertisement

Answer

If I understand correctly, you could opt for using groupby, transform, and unique.

df['KeyValue_String'] = df.groupby('AutoNumber').KeyValue_String.transform('unique')

Then you can drop duplicates assuming as mentioned in the comments that rows with the same AutoNumber contain duplicate information besides the KeyValue_String.

df = df.drop_duplicates(subset='AutoNumber')

I would advise if you want arrays you keep everything in the column as an array, and don’t expend effort putting mixed types in the column which will just be harder to work with anyways.

Demo

>>> df
    AutoNumber KeyValue_String
0        50899              DD
1        50905          Cheque
2        50906              DD
3        50907          Cheque
4        50908              DD
5        50909              DD
6        50910          Cheque
7        50911          Cheque
8        50912          Cheque
9        50913          Cheque
10       50914          Cheque
11       50914              DD
12       50915          Cheque
13       50916          Cheque
14       50916          Cheque

>>> df['KeyValue_String'] = df.groupby('AutoNumber').KeyValue_String.transform('unique')

>>> df.drop_duplicates(subset='AutoNumber')

    AutoNumber KeyValue_String
0        50899            [DD]
1        50905        [Cheque]
2        50906            [DD]
3        50907        [Cheque]
4        50908            [DD]
5        50909            [DD]
6        50910        [Cheque]
7        50911        [Cheque]
8        50912        [Cheque]
9        50913        [Cheque]
10       50914    [Cheque, DD]
12       50915        [Cheque]
13       50916        [Cheque]
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement