Skip to content
Advertisement

How does numpy avoid copy on access on child process from gc reference counting

On POSIX systems, after you fork(), data should only be copied to the child process after you write to it (copy on write). But because python stores the reference count in the object header, each time you iterate a list in the child process, it will copy it to its memory.

Testing that with lists and other data structures, I can assert that behavior, also some corroboration from a core developer: https://github.com/python/cpython/pull/3705#issuecomment-420201071

But after testing that with numpy arrays, that is not happening.

import ctypes
import os

import numpy as np
import psutil


def sharing_with_numpy():
    ppid = os.getpid()
    print(f'nSystem used memory: {int(psutil.virtual_memory().used / (1024 * 1024))} MB')
    big_data = np.array([[item, item] for item in list(range(10000000))])
    print(f'nSystem used memory: {int(psutil.virtual_memory().used / (1024 * 1024))} MB')
    print(ctypes.c_long.from_address(id(big_data)).value)
    ref1 = big_data[0]
    ref2 = big_data[0]
    print(ctypes.c_long.from_address(id(big_data)).value)

    print(f'nSystem used memory: {int(psutil.virtual_memory().used / (1024 * 1024))} MB')
    for i in range(5):
        if ppid == os.getpid():
            os.fork()
    for x in big_data:
        pass
    print(f'nSystem used memory: {int(psutil.virtual_memory().used / (1024 * 1024))} MB')


if __name__ == "__main__":
    sharing_with_numpy()

Output:

System used memory: 163 MB # before array allocation
System used memory: 318 MB # after array allocation
1 # reference count of the array
3 # reference count of the array
System used memory: 318 MB # before fork()
System used memory: 324 MB # after fork() and loop to reference array
System used memory: 328 MB # after fork() and loop to reference array
System used memory: 329 MB # after fork() and loop to reference array
System used memory: 331 MB # after fork() and loop to reference array
System used memory: 340 MB # after fork() and loop to reference array
System used memory: 342 MB # after fork() and loop to reference array

As you can see, memory grows, but only slightly, indicating that the entire array was not copied.

I’ve been trying to understand what is happening without luck, could you explain? Thank you

Advertisement

Answer

numpy arrays have an object header that contains a pointer to the underlying data, allocated separately. The data itself lacks any reference counts, so it doesn’t get modified merely by reading it.

Since numpy arrays are block allocated in bulk, the larger allocations for backing data stores don’t come from the small object pool the object headers come from (they’re typically bulk allocated from the OS directly, via mmap [*NIX] or VirtualAlloc [Windows], not allocated from a heap of memory subdivided among many allocations). Since they don’t share a page with anything that’s reference counted (they’re raw C types, not Python ints or the like with their own object headers), those pages never get written, and therefore never get copied.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement