I have a sample DF:
JavaScript
x
4
1
df = pd.DataFrame(np.random.randint(0,10,size = (6,2)),columns=["A","B"])
2
df["col"] = ["Apple","Apple","Mango","Apple","Mango","Apple"]
3
df
4
OP:
JavaScript
1
8
1
A B col
2
0 3 9 Apple
3
1 8 8 Apple
4
2 7 9 Mango
5
3 2 4 Apple
6
4 4 5 Mango
7
5 1 6 Apple
8
I am trying to get the values of columns – "A" and "B"
wherever there are duplicate values in column col
. For example the column col
has value Apple
in index – 0,1,3,5
and I am trying to get the respective values in column – A and B
, ie
JavaScript
1
2
1
{"Apple":[[3,9],[8,8],[2,4],[1,6]], "Mango": [[7,9],[4,5]]}
2
I have a iterative approach which takes a long time on big Dfs.
Current Approach:
-> Find Unique values in column col
JavaScript
1
2
1
unique_values = list(df["col"].unique())
2
-> Iterate through this list and a inner loop through every row of the DF to get the required OP:
JavaScript
1
12
12
1
op = {}
2
for i in range(len(unique_values)):
3
for index, rows in df.iterrows():
4
if rows ["col"] == unique_values[i]:
5
inner_op = []
6
inner_op.append(rows["A"]);inner_op.append(rows["B"])
7
if rows["col"] in op:
8
op[rows["col"]].append(inner_op)
9
else:
10
op[rows["col"]] = [inner_op]
11
12
final OP:
JavaScript
1
2
1
{'Apple': [[3, 9], [8, 8], [2, 4], [1, 6]], 'Mango': [[7, 9], [4, 5]]}
2
Is there any suggestions for a more pandas approach which could be more efficient?
Advertisement
Answer
Create column filled by list C
convert values to numpy array and to list and then aggregate list
by GroupBy.agg
with Series.to_dict
:
JavaScript
1
10
10
1
np.random.seed(2020)
2
df = pd.DataFrame(np.random.randint(0,10,size = (6,2)),columns=["A","B"])
3
df["col"] = ["Apple","Apple","Mango","Apple","Mango","Apple"]
4
5
df['C'] = df[['A','B']].to_numpy().tolist()
6
d = df.groupby('col')['C'].agg(list).to_dict()
7
print (d)
8
{'Apple': [[0, 8], [3, 6], [7, 8], [8, 9]], 'Mango': [[3, 3], [0, 0]]}
9
10