In my DataFrame, I have many instances of same AutoNumber
having different KeyValue_String
. I would like to convert these instances to a single row where the KeyValue_String
is a list comprised of the multiple unique values.
JavaScript
x
17
17
1
AutoNumber KeyValue_String ReferralType Description
2
0 50899 DD 3 Web Search
3
1 50905 Cheque 1 Gatestone Collections
4
2 50906 DD 2 Centum Mortgage Brokers
5
3 50907 Cheque 1 Financial Debt Recovery Ltd.
6
4 50908 DD 2 Centum Mortgage Brokers
7
5 50909 DD 2 Centum Mortgage Brokers
8
6 50910 Cheque 1 Allied International Credit
9
7 50911 Cheque 1 D&A Collection Corp
10
8 50912 Cheque 1 Gatestone Collections
11
9 50913 Cheque 1 Financial Debt Recovery Ltd.
12
10 50914 Cheque 3 Existing Customer - Refinancing
13
11 50914 DD 3 Existing Customer - Refinancing
14
12 50915 Cheque 1 Gatestone Collections
15
13 50916 Cheque 3 Existing Customer - Refinancing
16
14 50916 Cheque 3 Existing Customer - Refinancing
17
The desired output would look like this, except I want to keep all of the other columns
JavaScript
1
17
17
1
AutoNumber KeyValue_String
2
0 50899 DD
3
1 50905 Cheque
4
2 50906 DD
5
3 50907 Cheque
6
4 50908 DD
7
5 50909 DD
8
6 50910 Cheque
9
7 50911 Cheque
10
8 50912 Cheque
11
9 50913 Cheque
12
10 50914 [Cheque, DD]
13
11 50915 Cheque
14
12 50916 Cheque
15
13 50917 Cheque
16
14 50918 Cheque
17
Advertisement
Answer
If I understand correctly, you could opt for using groupby
, transform
, and unique
.
JavaScript
1
2
1
df['KeyValue_String'] = df.groupby('AutoNumber').KeyValue_String.transform('unique')
2
Then you can drop duplicates assuming as mentioned in the comments that rows with the same AutoNumber contain duplicate information besides the KeyValue_String.
JavaScript
1
2
1
df = df.drop_duplicates(subset='AutoNumber')
2
I would advise if you want arrays you keep everything in the column as an array, and don’t expend effort putting mixed types in the column which will just be harder to work with anyways.
Demo
JavaScript
1
37
37
1
>>> df
2
AutoNumber KeyValue_String
3
0 50899 DD
4
1 50905 Cheque
5
2 50906 DD
6
3 50907 Cheque
7
4 50908 DD
8
5 50909 DD
9
6 50910 Cheque
10
7 50911 Cheque
11
8 50912 Cheque
12
9 50913 Cheque
13
10 50914 Cheque
14
11 50914 DD
15
12 50915 Cheque
16
13 50916 Cheque
17
14 50916 Cheque
18
19
>>> df['KeyValue_String'] = df.groupby('AutoNumber').KeyValue_String.transform('unique')
20
21
>>> df.drop_duplicates(subset='AutoNumber')
22
23
AutoNumber KeyValue_String
24
0 50899 [DD]
25
1 50905 [Cheque]
26
2 50906 [DD]
27
3 50907 [Cheque]
28
4 50908 [DD]
29
5 50909 [DD]
30
6 50910 [Cheque]
31
7 50911 [Cheque]
32
8 50912 [Cheque]
33
9 50913 [Cheque]
34
10 50914 [Cheque, DD]
35
12 50915 [Cheque]
36
13 50916 [Cheque]
37