Skip to content
Advertisement

How are small sets stored in memory?

If we look at the resize behavior for sets under 50k elements:

JavaScript

This pattern is consistent with quadrupling of the backing storage size once the set is 3/5ths full, plus some presumably constant overhead for the PySetObject:

JavaScript

A similar pattern continues even for larger sets, but the resize factor switches to doubling instead of quadrupling.

The reported size for small sets is an outlier. Instead of size 344 bytes, i.e. 16 * 8 + 216 (the storage array of a newly created empty set has 8 slots avail until the first resize up to 32 slots) only 216 bytes is reported by sys.getsizeof.

What am I missing? How are those small sets stored so that they use only 216 bytes instead of 344?

Advertisement

Answer

The set object in Python is represented by the following C structure.

JavaScript

Now remember, getsizeof() calls the object’s __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector.

Ok, set implements the __sizeof__.

JavaScript

Now let’s inspect the line

JavaScript

_PyObject_SIZE is just a macro which expands to (typeobj)->tp_basicsize.

JavaScript

This code is essentially trying to access the tp_basicsize slot to get the size in bytes of instances of the type which is just sizeof(PySetObject) in case of set.

JavaScript

I have modified the set_sizeof C function with the following changes.

JavaScript

and compiling and running these changes gives me

JavaScript

The return value is 216/728 bytes because sys.getsize add 16 bytes of GC overhead.

But the important thing to note here is this line.

JavaScript

Because for small tables(before the first resize) so->table is just a reference to fixed size(8) so->smalltable(No malloc’ed memory) so sizeof(PySetObject) is sufficient enough to get the size because it also includes the storage size( 128(16(size of setentry) * 8)).

Now what happens when the resize occurs? It constructs entirely new table (malloc’ed) and uses that table instead of so->smalltables. This means that the sets, which have resized, also carry out a dead-weight of 128 bytes (size of fixed size small table) along with the size of malloc’ed so->table.

JavaScript
Advertisement