I am trying to have a parent python script sent variables to a child script to help me speed-up and automate video analysis.
I am now using the subprocess.Popen()
call to start-up 6 instances of a child script but cannot find a way to pass variables and modules already called for in the parent to the child. For example, the parent file would have:
import sys import subprocess parent_dir = os.path.realpath(sys.argv[0]) subprocess.Popen(sys.executable, 'analysis.py')
but then import sys; import subprocess; parent_dir
have to be called again in “analysis.py”. Is there a way to pass them to the child?
In short, what I am trying to achieve is: I have a folder with a couple hundred video files. I want the parent python script to list the video files and start up to 6 parallel instances of an analysis script that each analyse one video file. If there are no more files to be analysed the parent file stops.
Advertisement
Answer
The simple answer here is: don’t use subprocess.Popen
, use multiprocessing.Process
. Or, better yet, multiprocessing.Pool
or concurrent.futures.ProcessPoolExecutor
.
With subprocess
, your program’s Python interpreter doesn’t know anything about the subprocess at all; for all it knows, the child process is running Doom. So there’s no way to directly share information with it.* But with multiprocessing
, Python controls launching the subprocess and getting everything set up so that you can share data as conveniently as possible.
Unfortunately “as conveniently as possible” still isn’t 100% as convenient as all being in one process. But what you can do is usually good enough. Read the section on Exchanging objects between processes and the following few sections; hopefully one of those mechanisms will be exactly what you need.
But, as I implied at the top, in most cases you can make it even simpler, by using a pool. Instead of thinking about “running 6 processes and sharing data with them”, just think about it as “running a bunch of tasks on a pool of 6 processes”. A task is basically just a function—it takes arguments, and returns a value. If the work you want to parallelize fits into that model—and it sounds like your work does—life is as simple as could be. For example:
import multiprocessing import os import sys import analysis parent_dir = os.path.realpath(sys.argv[0]) paths = [os.path.join(folderpath, file) for file in os.listdir(folderpath)] with multiprocessing.Pool(processes=6) as pool: results = pool.map(analysis.analyze, paths)
If you’re using Python 3.2 or earlier (including 2.7), you can’t use a Pool
in a with
statement. I believe you want this:**
pool = multiprocessing.Pool(processes=6) try: results = pool.map(analysis.analyze, paths) finally: pool.close() pool.join()
This will start up 6 processes,*** then tell the first one to do analysis.analyze(paths[0])
, the second to do analysis.analyze(paths[1])
, etc. As soon as any of the processes finishes, the pool will give it the next path to work on. When they’re all finished, you get back a list of all the results.****
Of course this means that the top-level code that lived in analysis.py
has to be moved into a function def analyze(path):
so you can call it. Or, even better, you can move that function into the main script, instead of a separate file, if you really want to save that import
line.
* You can still indirectly share information by, e.g., marshaling it into some interchange format like JSON and pass it via the stdin/stdout pipes, a file, a shared memory segment, a socket, etc., but multiprocessing
effectively wraps that up for you to make it a whole lot easier.
** There are different ways to shut a pool down, and you can also choose whether or not to join it immediately, so you really should read up on the details at some point. But when all you’re doing is calling pool.map
, it really doesn’t matter; the pool is guaranteed to shut down and be ready to join nearly instantly by the time the map
call returns.
*** I’m not sure why you wanted 6; most machines have 4, 8, or 16 cores, not 6; why not use them all? The best thing to do is usually to just leave out the processes=6
entirely and let multiprocessing
ask your OS how many cores to use, which means it’ll still run at full speed on your new machine with twice as many cores that you’ll buy next year.
**** This is slightly oversimplified; usually the pool will give the first process a batch of files, not one at a time, to save a bit of overhead, and you can manually control the batching if you need to optimize things or sequence them more carefully. But usually you don’t care, and this oversimplification is fine.