Skip to content
Advertisement

Packages that are imported are not recognized during parallel computing?

I’m running the function get_content in parallel setting with multiprocess.Pool. Then it throws out an error NameError: name 'session' is not defined. Clearly, I defined it with session = requests.Session() . Could you please elaborate on this issue?

JavaScript

Advertisement

Answer

First of all, your import statement is incorrect and should be:

JavaScript

(You had from multiprocess ..., so I am not sure how it ran at all)

With the correct import statement, the code runs for me, but it does not what you think it does! I surmise by the call to freeze_support that you are running under Windows. Under that platform new processes are called by calls to system function spawn, which results in the entire program executing from the very top. This is why it becomes necessary for the code that creates the new processes to be within a block that is governed by if __name__ == '__main__':. If it weren’t, then your newly created processes would be re-executing the code that just created them in a never-ending recursive loop spawning new processes forever.

This means that each process is re-creating its own Session instance due to the following statement being at global scope:

JavaScript

So you are getting no real benefit of being able to re-use the same Session instance for the multiple URLs you are attempting to retrieve. In order to reuse the same Session instance, you must initialize the multiprocessing pool itself with the session object so that it resides in shared memory and visible to all processes. You should also only keep the minimal executable code at global scope:

JavaScript

But in fact your code is mostly spending its time waiting for the URLs to be retrieved and just a little CPU time in processing the returned HTML. This is probably a good candidate for using multithreading instead of multiprocessing. The only changes you needed to your original code to use multithreading is (1) get rid of all references to freeze_support (which you did not need for multiprocessing unless you were planning on creating an exe file) and change one import statement:

JavaScript

Additionally, you should not be limited by the number of CPU cores you have in determining the number of threads to use (although there is some maximum you will not want to go over):

JavaScript

And, finally, you can combine both a thread pool and a multiprocessing pool using the latter to take care of the CPU-intensive portion of the processing:

JavaScript
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement