Pandas get column values based on duplicate rows

I have a sample DF:

df = pd.DataFrame(np.random.randint(0,10,size = (6,2)),columns=["A","B"])
df["col"] = ["Apple","Apple","Mango","Apple","Mango","Apple"]
df

OP:

    A   B   col
0   3   9   Apple
1   8   8   Apple
2   7   9   Mango
3   2   4   Apple
4   4   5   Mango
5   1   6   Apple

I am trying to get the values of columns – "A" and "B" wherever there are duplicate values in column col. For example the column col has value Apple in index – 0,1,3,5 and I am trying to get the respective values in column – A and B, ie

 {"Apple":[[3,9],[8,8],[2,4],[1,6]], "Mango": [[7,9],[4,5]]}

I have a iterative approach which takes a long time on big Dfs.

Current Approach:

-> Find Unique values in column col

unique_values = list(df["col"].unique())

-> Iterate through this list and a inner loop through every row of the DF to get the required OP:

op = {}
for i in range(len(unique_values)):
    for index, rows in df.iterrows():
        if rows ["col"] == unique_values[i]:
            inner_op = []
            inner_op.append(rows["A"]);inner_op.append(rows["B"])
            if rows["col"] in op:
                op[rows["col"]].append(inner_op)
            else:
                op[rows["col"]] = [inner_op]

final OP:

{'Apple': [[3, 9], [8, 8], [2, 4], [1, 6]], 'Mango': [[7, 9], [4, 5]]}

Is there any suggestions for a more pandas approach which could be more efficient?

Answer

Create column filled by list C convert values to numpy array and to list and then aggregate list by GroupBy.agg with Series.to_dict:

np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,10,size = (6,2)),columns=["A","B"])
df["col"] = ["Apple","Apple","Mango","Apple","Mango","Apple"]

df['C'] = df[['A','B']].to_numpy().tolist()
d = df.groupby('col')['C'].agg(list).to_dict()
print (d)
{'Apple': [[0, 8], [3, 6], [7, 8], [8, 9]], 'Mango': [[3, 3], [0, 0]]}

Advertisement

Answer