Skip to content
Advertisement

Python. Merge dict rows with SUM

I have a lot of dict rows, more than 10 million, like this:

{'value_01': '123', 'value_02': '456', 'datacenter': '1', 'bytes': '25'}
{'value_01': '123', 'value_02': '456', 'datacenter': '1', 'bytes': '35'}
{'value_01': '678', 'value_02': '901', 'datacenter': '2', 'bytes': '55'}
{'value_01': '678', 'value_02': '456', 'datacenter': '2', 'bytes': '15'}

Is it possible to merge rows where all others key and values are the same into one make SUM of 'bytes': I would like to minimize the number of rows and have like this. It should speed up the next steps of processing.

{'value_01': '123', 'value_02': '456', 'datacenter': '1', 'bytes': '60'}
{'value_01': '678', 'value_02': '901', 'datacenter': '2', 'bytes': '55'}
{'value_01': '678', 'value_02': '456', 'datacenter': '2', 'bytes': '15'}

Thanks in advance.

Advertisement

Answer

Using an intermediate dictionary indexed on all “other” keys, you can accumulate the ‘byte’ values in a common dictionary for each combination of other fields. then convert the indexed values back into a list of dictionaries:

lst = [{'value_01': '123', 'value_02': '456', 'datacenter': '1', 'bytes': '25'},
       {'value_01': '123', 'value_02': '456', 'datacenter': '1', 'bytes': '35'},
       {'value_01': '678', 'value_02': '901', 'datacenter': '2', 'bytes': '55'},
       {'value_01': '678', 'value_02': '456', 'datacenter': '2', 'bytes': '15'}]

merged = dict()
for d in lst:
    k = map(d.get,sorted({*d}-{"bytes"}))  # index on all other fields
    m = merged.setdefault(tuple(k),d)      # add/get first instance
    if m is not d:                         # accumulate bytes (as strings) 
        m['bytes'] = str(int(m['bytes']) + int(d['bytes']))
mergedList = list(merged.values())

print(mergedList)
[{'value_01': '123', 'value_02': '456', 'datacenter': '1', 'bytes': '60'},
 {'value_01': '678', 'value_02': '901', 'datacenter': '2', 'bytes': '55'},
 {'value_01': '678', 'value_02': '456', 'datacenter': '2', 'bytes': '15'}]

This will work without sorting (i.e in O(n) time) even if your data is not grouped by the combination of other fields. It will also work if the order of keys are different. Missing keys would be problematic but can be taken into account using a comprehension instead of map(d.get,.

Note that you really should store the byte counts as integers instead of strings

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement