Skip to content
Advertisement

How to cut my addiction to Python dictionaries

So, I have a large 30k line program I’ve been writing for a year. It basically gathers non-normalized and non-standardized data from multiple sources and matches everything up after standardizing the sources.

I’ve written most everything with ordered dictionaries. This allowed me to keep the columns ordered, named and mutable, which made processing easier as values can be assigned/fixed throughout the entire mess of code.

However, I’m currently running out of RAM from all these dictionaries. I’ve since learned that switching to namedtuples will fix this, the only problem is that these aren’t mutable, so that brings up one issue in doing the conversion.

I believe I could use a class to eliminate the immutability, but will may RAM savings be the same? Another option would be to use namedtuples and reassign them to new namedtouples every time a value needs to change (i.e. NewTup=Tup(oldTup.odj1, oldTup.odj2, “something new”). But I think I’d need an explicit way to destroy the old one afterwords or space could become an issue again.

The bottom line is my input files are about 6GB on disk (lots of data). I’m forced to process this data on a server with 16GB RAM and 4 GB swap. I originally programmed all the rows of these various I/O data sets with dictionaries, which is eating too much RAM… but the mutable nature and named referencing was a huge help in faster development, how do I cut my addition to dictionaries so that I can utilize the cost savings of other objects without rewriting the entire application do to the immutable nature of tuples.

SAMPLE CODE:

    for tan_programs_row in f_tan_programs:
    #stats not included due to urgent need
    tan_id = tan_programs_row["Computer ID"].strip() #The Tanium ID by which to reference all other tanium files (i.e. primary key)
    if("NO RESULT" not in tan_id.upper()):
        tan_programs_name = tan_programs_row["Name"].strip() #The Program Name
        tan_programs_publisher = tan_programs_row["Publisher"].strip() #The Program Vendor
        tan_programs_version = tan_programs_row["Version"].strip() #The Program Vendor

        try:
            unnorm_tan_dict[tan_id] #test the key, if non-existent go to exception
        except KeyError:
            #form the item since it doesn't exist yet
            unnorm_tan_dict[tan_id] = {
                "Tanium ID": tan_id,
                "Computer Name": "INDETERMINATE",
                "Operating System": "INDETERMINATE",
                "Operating System Build Number": "INDETERMINATE",
                "Service Pack": "INDETERMINATE",
                "Country Code": "INDETERMINATE",
                "Manufacturer": "INDETERMINATE",
                "Model": "INDETERMINATE",
                "Serial": "INDETERMINATE"
            }
        unnorm_tan_prog_list.append(rows.TanRawProg._make([tan_id, tan_programs_name, tan_programs_publisher, tan_programs_version]))

for tan_processes_row in f_tan_processes:
    #stats not included due to urgent need
    tan_id = tan_processes_row["Computer ID"].strip() #The Tanium ID by which to reference all other tanium files (i.e. primary key)
    if("NO RESULT" not in tan_id.upper()):
        tan_process_name = tan_processes_row["Running Processes"].strip() #The Program Name
        try:
            unnorm_tan_dict[tan_id] #test the key, if non-existent go to exception
        except KeyError:
            #form the item since it doesn't exist yet
            unnorm_tan_dict[tan_id] = {
                "Tanium ID": tan_id,
                "Computer Name": "INDETERMINATE",
                "Operating System": "INDETERMINATE",
                "Operating System Build Number": "INDETERMINATE",
                "Service Pack": "INDETERMINATE",
                "Country Code": "INDETERMINATE",
                "Manufacturer": "INDETERMINATE",
                "Model": "INDETERMINATE",
                "Serial": "INDETERMINATE"
            }
        unnorm_tan_proc_list.append(rows.TanRawProc._make([tan_id, tan_process_name]))

*Later on these values are often changed by bringing in other data sets.

Advertisement

Answer

Just write your own class, and use __slots__ to keep the memory footprint to a minimum:

class UnnormTan(object):
    __slots__ = ('tan_id', 'computer_name', ...)
    def __init__(self, tan_id, computer_name="INDETERMINATE", ...):
        self.tan_id = tan_id
        self.computer_name = computer_name
        # ...

This can get a little verbose perhaps, and if you need to use these as dictionary keys you’ll have more typing to do.

There is a project that makes creating such classes easier: attrs:

from attrs import attrs, attrib

@attrs(slots=True)
class UnnormTan(object):
    tan_id = attrib()
    computer_name = attrib(default="INDETERMINATE")
    # ...

Classes created with the attrs library automatically take care of proper equality testing, representation and hashability.

Such objects are the most memory efficient representation of data Python can offer. If that is not enough (and it could well be that it is not enough), you need to look at offloading your data to disk. The easiest way to do that is to use a SQL database, like with the bundled sqlite3 SQLite library. Even if you used a :memory: temporary database, the database will manage your memory load by swapping out pages to disk as needed.

Advertisement