I have a sample DF:
df = pd.DataFrame(np.random.randint(0,10,size = (6,2)),columns=["A","B"]) df["col"] = ["Apple","Apple","Mango","Apple","Mango","Apple"] df
OP:
A B col 0 3 9 Apple 1 8 8 Apple 2 7 9 Mango 3 2 4 Apple 4 4 5 Mango 5 1 6 Apple
I am trying to get the values of columns – "A" and "B" wherever there are duplicate values in column col. For example the column col has value Apple in index – 0,1,3,5 and I am trying to get the respective values in column – A and B, ie
{"Apple":[[3,9],[8,8],[2,4],[1,6]], "Mango": [[7,9],[4,5]]}
I have a iterative approach which takes a long time on big Dfs.
Current Approach:
-> Find Unique values in column col
unique_values = list(df["col"].unique())
-> Iterate through this list and a inner loop through every row of the DF to get the required OP:
op = {}
for i in range(len(unique_values)):
for index, rows in df.iterrows():
if rows ["col"] == unique_values[i]:
inner_op = []
inner_op.append(rows["A"]);inner_op.append(rows["B"])
if rows["col"] in op:
op[rows["col"]].append(inner_op)
else:
op[rows["col"]] = [inner_op]
final OP:
{'Apple': [[3, 9], [8, 8], [2, 4], [1, 6]], 'Mango': [[7, 9], [4, 5]]}
Is there any suggestions for a more pandas approach which could be more efficient?
Advertisement
Answer
Create column filled by list C convert values to numpy array and to list and then aggregate list by GroupBy.agg with Series.to_dict:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,10,size = (6,2)),columns=["A","B"])
df["col"] = ["Apple","Apple","Mango","Apple","Mango","Apple"]
df['C'] = df[['A','B']].to_numpy().tolist()
d = df.groupby('col')['C'].agg(list).to_dict()
print (d)
{'Apple': [[0, 8], [3, 6], [7, 8], [8, 9]], 'Mango': [[3, 3], [0, 0]]}