Packages that are imported are not recognized during parallel computing?

Question

I&#8217;m running the function get_content in parallel setting with multiprocess.Pool. Then it throws out an error NameError: name &#8216;session&#8217; is not defined. Clearly, I defined it with session = requests.Session() . Could you please elaborate on this issue? Answer First of all, your import statemen…

Accepted Answer

First of all, your import statement is incorrect and should be:from multiprocessing import Pool, freeze_support(You had from multiprocess ..., so I am not sure how it ran at all)With the correct import statement, the code runs for me, but it does not what you think it does! I surmise by the call to freeze_support that you are running under Windows. Under that platform new processes are called by calls to system function spawn, which results in the entire program executing from the very top. This is why it becomes necessary for the code that creates the new processes to be within a block that is governed by if __name__ == '__main__':. If it weren’t, then your newly created processes would be re-executing the code that just created them in a never-ending recursive loop spawning new processes forever.This means that each process is re-creating its own Session instance due to the following statement being at global scope:session = requests.Session()So you are getting no real benefit of being able to re-use the same Session instance for the multiple URLs you are attempting to retrieve. In order to reuse the same Session instance, you must initialize the multiprocessing pool itself with the session object so that it resides in shared memory and visible to all processes. You should also only keep the minimal executable code at global scope:import requests, osfrom bs4 import BeautifulSoupfrom multiprocessing import Pool, freeze_supportheaders = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}def init_pool(s): global session session = s############ Get content of a worddef get_content(l): r = session.get(l, headers = headers) soup = BeautifulSoup(r.content, 'html.parser') entry_name = soup.select_one('#article-heading_3-0').contents[0] main = soup.select('.comp.article-body.mntl-block')[0] content = entry_name + 'n' + '' + 'n' + str(main) + 'nn' return(content)############ Parallel computingif __name__=="__main__": core = os.cpu_count() session = requests.Session() links = ['https://www.investopedia.com/terms/1/0x-protocol.asp', 'https://www.investopedia.com/terms/1/1-10net30.asp', 'https://www.investopedia.com/terms/1/10-k.asp', 'https://www.investopedia.com/terms/1/10k-wrap.asp', 'https://www.investopedia.com/terms/1/10q.asp'] p = Pool(processes = core, initializer=init_pool, initargs=(session,)) content_list = p.map(get_content, links) content_all = ''.join(content_list) print(content_all) freeze_support()But in fact your code is mostly spending its time waiting for the URLs to be retrieved and just a little CPU time in processing the returned HTML. This is probably a good candidate for using multithreading instead of multiprocessing. The only changes you needed to your original code to use multithreading is (1) get rid of all references to freeze_support (which you did not need for multiprocessing unless you were planning on creating an exe file) and change one import statement:from multiprocessing.dummy import PoolAdditionally, you should not be limited by the number of CPU cores you have in determining the number of threads to use (although there is some maximum you will not want to go over):import requests, osfrom bs4 import BeautifulSoupfrom multiprocessing.dummy import Poolsession = requests.Session()headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}links = ['https://www.investopedia.com/terms/1/0x-protocol.asp', 'https://www.investopedia.com/terms/1/1-10net30.asp', 'https://www.investopedia.com/terms/1/10-k.asp', 'https://www.investopedia.com/terms/1/10k-wrap.asp', 'https://www.investopedia.com/terms/1/10q.asp']############ Get content of a worddef get_content(l): r = session.get(l, headers = headers) soup = BeautifulSoup(r.content, 'html.parser') entry_name = soup.select_one('#article-heading_3-0').contents[0] main = soup.select('.comp.article-body.mntl-block')[0] content = entry_name + 'n' + '' + 'n' + str(main) + 'nn' return(content)############ Concurrent computingif __name__=="__main__": # max of 25 is arbitrary; we do not want to appear to be a denial of service attack P = Pool(processes = min(len(links), 25)) content_list = P.map(get_content, links) content_all = ''.join(content_list) print(content_all)And, finally, you can combine both a thread pool and a multiprocessing pool using the latter to take care of the CPU-intensive portion of the processing:import requests, osfrom bs4 import BeautifulSoupfrom multiprocessing.pool import ThreadPoolfrom multiprocessing.pool import Poolimport osfrom functools import partialsession = requests.Session()headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}links = ['https://www.investopedia.com/terms/1/0x-protocol.asp', 'https://www.investopedia.com/terms/1/1-10net30.asp', 'https://www.investopedia.com/terms/1/10-k.asp', 'https://www.investopedia.com/terms/1/10k-wrap.asp', 'https://www.investopedia.com/terms/1/10q.asp']############ Get content of a worddef get_content(process_pool, l): r = session.get(l, headers = headers) return process_pool.apply(process_content, args=(r.content,))def process_content(content): soup = BeautifulSoup(content, 'html.parser') entry_name = soup.select_one('#article-heading_3-0').contents[0] main = soup.select('.comp.article-body.mntl-block')[0] content = entry_name + 'n' + '' + 'n' + str(main) + 'nn' return(content)############ Parallel computingif __name__=="__main__": process_pool = Pool(processes = min(len(links), os.cpu_count())) thread_pool = ThreadPool(processes = min(len(links), 25)) content_list = thread_pool.map(partial(get_content, process_pool), links) content_all = ''.join(content_list) print(content_all)

Advertisement

Answer