Skip to content
Advertisement

Summing duplicates rows

I have a database with more than 300 duplicates that look like this:

rate    shipment_id original_cost
8.14    500410339210    5.93
7.81    500410339221    5.93
8.53    500410339232    7.07
8.53    500410339243    14.31
2.76    500410345319    68.87
... ... ... ...
8.46    987506030619    7.36
8.46    987506030620    7.36
7.32    987506030630    6.80
27.82   997311250164181 144.44
7.32    997355250064942 19.83

I want that for each duplicate shipment_id only original_cost gets added together and rates remain as they are.

rate    shipment_id original_cost
3.06    926401748430    2.54
3.06    926401748430    19.60
16.34   926401748430    2.54
16.34   926401748430    19.60

like for these duplicates: it should look something like this:

rate    shipment_id original_cost
3.06    926401748430    22.14
3.06    926401748430    22.14
16.34   926401748430    22.14
16.34   926401748430    22.14

is there any way to do this?

Advertisement

Answer

Group by the duplicate values (['shipment_id', 'rate']) and use transform on the “original_cost” column to calculate the sum:

df['original_cost'] = df.groupby(['shipment_id', 'rate'])['original_cost'].transform('sum')

Example input:

    rate    shipment_id original_cost
0   3.06    926401748430    22.14
1   3.06    926401748430    22.14
2   16.34   926401748430    22.14
3   16.34   926401748430    22.14

Example output:

    rate   shipment_id  original_cost
0   3.06  926401748430          22.14
1   3.06  926401748430          22.14
2  16.34  926401748430          22.14
3  16.34  926401748430          22.14
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement