I have a sample DF:
df = pd.DataFrame(np.random.randint(0,10,size = (6,2)),columns=["A","B"]) df["col"] = ["Apple","Apple","Mango","Apple","Mango","Apple"] df
OP:
A B col 0 3 9 Apple 1 8 8 Apple 2 7 9 Mango 3 2 4 Apple 4 4 5 Mango 5 1 6 Apple
I am trying to get the values of columns – "A" and "B"
wherever there are duplicate values in column col
. For example the column col
has value Apple
in index – 0,1,3,5
and I am trying to get the respective values in column – A and B
, ie
{"Apple":[[3,9],[8,8],[2,4],[1,6]], "Mango": [[7,9],[4,5]]}
I have a iterative approach which takes a long time on big Dfs.
Current Approach:
-> Find Unique values in column col
unique_values = list(df["col"].unique())
-> Iterate through this list and a inner loop through every row of the DF to get the required OP:
op = {} for i in range(len(unique_values)): for index, rows in df.iterrows(): if rows ["col"] == unique_values[i]: inner_op = [] inner_op.append(rows["A"]);inner_op.append(rows["B"]) if rows["col"] in op: op[rows["col"]].append(inner_op) else: op[rows["col"]] = [inner_op]
final OP:
{'Apple': [[3, 9], [8, 8], [2, 4], [1, 6]], 'Mango': [[7, 9], [4, 5]]}
Is there any suggestions for a more pandas approach which could be more efficient?
Advertisement
Answer
Create column filled by list C
convert values to numpy array and to list and then aggregate list
by GroupBy.agg
with Series.to_dict
:
np.random.seed(2020) df = pd.DataFrame(np.random.randint(0,10,size = (6,2)),columns=["A","B"]) df["col"] = ["Apple","Apple","Mango","Apple","Mango","Apple"] df['C'] = df[['A','B']].to_numpy().tolist() d = df.groupby('col')['C'].agg(list).to_dict() print (d) {'Apple': [[0, 8], [3, 6], [7, 8], [8, 9]], 'Mango': [[3, 3], [0, 0]]}