Combine Pool.map with shared memory Array in Python multiprocessing

Question

I have a very large (read only) array of data that I want to be processed by multiple processes in parallel. I like the Pool.map function and would like to use it to calculate functions on that data in parallel. I saw that one can use the Value or Array class to use shared memory data between processes. But when

Accepted Answer

Trying again as I just saw the bounty ;)Basically I think the error message means what it said &#8211; multiprocessing shared memory Arrays can&#8217;t be passed as arguments (by pickling). It doesn&#8217;t make sense to serialise the data &#8211; the point is the data is shared memory. So you have to make the shared array global. I think it&#8217;s neater to put it as the attribute of a module, as in my first answer, but just leaving it as a global variable in your example also works well. Taking on board your point of not wanting to set the data before the fork, here is a modified example. If you wanted to have more than one possible shared array (and that&#8217;s why you wanted to pass toShare as an argument) you could similarly make a global list of shared arrays, and just pass the index to count_it (which would become for c in toShare[i]:). from sys import stdinfrom multiprocessing import Pool, Array, Processdef count_it( key ):  count = 0  for c in toShare:    if c == key:      count += 1  return countif __name__ == '__main__':  # allocate shared array - want lock=False in this case since we   # aren't writing to it and want to allow multiple processes to access  # at the same time - I think with lock=True there would be little or   # no speedup  maxLength = 50  toShare = Array('c', maxLength, lock=False)  # fork  pool = Pool()  # can set data after fork  testData = "abcabcs bsdfsdf gdfg dffdgdfg sdfsdfsd sdfdsfsdf"  if len(testData) > maxLength:      raise ValueError, "Shared array too small to hold data"  toShare[:len(testData)] = testData  print pool.map( count_it, ["a", "b", "s", "d"] )[EDIT: The above doesn&#8217;t work on windows because of not using fork. However, the below does work on Windows, still using Pool, so I think this is the closest to what you want: from sys import stdinfrom multiprocessing import Pool, Array, Processimport mymoduledef count_it( key ):  count = 0  for c in mymodule.toShare:    if c == key:      count += 1  return countdef initProcess(share):  mymodule.toShare = shareif __name__ == '__main__':  # allocate shared array - want lock=False in this case since we   # aren't writing to it and want to allow multiple processes to access  # at the same time - I think with lock=True there would be little or   # no speedup  maxLength = 50  toShare = Array('c', maxLength, lock=False)  # fork  pool = Pool(initializer=initProcess,initargs=(toShare,))  # can set data after fork  testData = "abcabcs bsdfsdf gdfg dffdgdfg sdfsdfsd sdfdsfsdf"  if len(testData) > maxLength:      raise ValueError, "Shared array too small to hold data"  toShare[:len(testData)] = testData  print pool.map( count_it, ["a", "b", "s", "d"] )Not sure why map won&#8217;t Pickle the array but Process and Pool will &#8211; I think perhaps it has be transferred at the point of the subprocess initialization on windows. Note that the data is still set after the fork though.

Advertisement

Answer