I’m running the function get_content
in parallel setting with multiprocess.Pool
. Then it throws out an error NameError: name 'session' is not defined
. Clearly, I defined it with session = requests.Session()
. Could you please elaborate on this issue?
import requests, os from bs4 import BeautifulSoup from multiprocess import Pool, freeze_support core = os.cpu_count() session = requests.Session() headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'} links = ['https://www.investopedia.com/terms/1/0x-protocol.asp', 'https://www.investopedia.com/terms/1/1-10net30.asp', 'https://www.investopedia.com/terms/1/10-k.asp', 'https://www.investopedia.com/terms/1/10k-wrap.asp', 'https://www.investopedia.com/terms/1/10q.asp'] ############ Get content of a word def get_content(l): r = session.get(l, headers = headers) soup = BeautifulSoup(r.content, 'html.parser') entry_name = soup.select_one('#article-heading_3-0').contents[0] main = soup.select('.comp.article-body.mntl-block')[0] content = entry_name + 'n' + '<link href="investopedia.css" rel="stylesheet"/>' + 'n' + str(main) + 'n</>n' return(content) ############ Parallel computing if __name__=="__main__": P = Pool(processes = core) content_list = P.map(get_content, links) content_all = ''.join(content_list) freeze_support()
Advertisement
Answer
First of all, your import statement is incorrect and should be:
from multiprocessing import Pool, freeze_support
(You had from multiprocess ...
, so I am not sure how it ran at all)
With the correct import statement, the code runs for me, but it does not what you think it does! I surmise by the call to freeze_support
that you are running under Windows. Under that platform new processes are called by calls to system function spawn
, which results in the entire program executing from the very top. This is why it becomes necessary for the code that creates the new processes to be within a block that is governed by if __name__ == '__main__':
. If it weren’t, then your newly created processes would be re-executing the code that just created them in a never-ending recursive loop spawning new processes forever.
This means that each process is re-creating its own Session
instance due to the following statement being at global scope:
session = requests.Session()
So you are getting no real benefit of being able to re-use the same Session
instance for the multiple URLs you are attempting to retrieve. In order to reuse the same Session
instance, you must initialize the multiprocessing pool itself with the session object so that it resides in shared memory and visible to all processes. You should also only keep the minimal executable code at global scope:
import requests, os from bs4 import BeautifulSoup from multiprocessing import Pool, freeze_support headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'} def init_pool(s): global session session = s ############ Get content of a word def get_content(l): r = session.get(l, headers = headers) soup = BeautifulSoup(r.content, 'html.parser') entry_name = soup.select_one('#article-heading_3-0').contents[0] main = soup.select('.comp.article-body.mntl-block')[0] content = entry_name + 'n' + '<link href="investopedia.css" rel="stylesheet"/>' + 'n' + str(main) + 'n</>n' return(content) ############ Parallel computing if __name__=="__main__": core = os.cpu_count() session = requests.Session() links = ['https://www.investopedia.com/terms/1/0x-protocol.asp', 'https://www.investopedia.com/terms/1/1-10net30.asp', 'https://www.investopedia.com/terms/1/10-k.asp', 'https://www.investopedia.com/terms/1/10k-wrap.asp', 'https://www.investopedia.com/terms/1/10q.asp'] p = Pool(processes = core, initializer=init_pool, initargs=(session,)) content_list = p.map(get_content, links) content_all = ''.join(content_list) print(content_all) freeze_support()
But in fact your code is mostly spending its time waiting for the URLs to be retrieved and just a little CPU time in processing the returned HTML. This is probably a good candidate for using multithreading instead of multiprocessing. The only changes you needed to your original code to use multithreading is (1) get rid of all references to freeze_support
(which you did not need for multiprocessing unless you were planning on creating an exe file) and change one import
statement:
from multiprocessing.dummy import Pool
Additionally, you should not be limited by the number of CPU cores you have in determining the number of threads to use (although there is some maximum you will not want to go over):
import requests, os from bs4 import BeautifulSoup from multiprocessing.dummy import Pool session = requests.Session() headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'} links = ['https://www.investopedia.com/terms/1/0x-protocol.asp', 'https://www.investopedia.com/terms/1/1-10net30.asp', 'https://www.investopedia.com/terms/1/10-k.asp', 'https://www.investopedia.com/terms/1/10k-wrap.asp', 'https://www.investopedia.com/terms/1/10q.asp'] ############ Get content of a word def get_content(l): r = session.get(l, headers = headers) soup = BeautifulSoup(r.content, 'html.parser') entry_name = soup.select_one('#article-heading_3-0').contents[0] main = soup.select('.comp.article-body.mntl-block')[0] content = entry_name + 'n' + '<link href="investopedia.css" rel="stylesheet"/>' + 'n' + str(main) + 'n</>n' return(content) ############ Concurrent computing if __name__=="__main__": # max of 25 is arbitrary; we do not want to appear to be a denial of service attack P = Pool(processes = min(len(links), 25)) content_list = P.map(get_content, links) content_all = ''.join(content_list) print(content_all)
And, finally, you can combine both a thread pool and a multiprocessing pool using the latter to take care of the CPU-intensive portion of the processing:
import requests, os from bs4 import BeautifulSoup from multiprocessing.pool import ThreadPool from multiprocessing.pool import Pool import os from functools import partial session = requests.Session() headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'} links = ['https://www.investopedia.com/terms/1/0x-protocol.asp', 'https://www.investopedia.com/terms/1/1-10net30.asp', 'https://www.investopedia.com/terms/1/10-k.asp', 'https://www.investopedia.com/terms/1/10k-wrap.asp', 'https://www.investopedia.com/terms/1/10q.asp'] ############ Get content of a word def get_content(process_pool, l): r = session.get(l, headers = headers) return process_pool.apply(process_content, args=(r.content,)) def process_content(content): soup = BeautifulSoup(content, 'html.parser') entry_name = soup.select_one('#article-heading_3-0').contents[0] main = soup.select('.comp.article-body.mntl-block')[0] content = entry_name + 'n' + '<link href="investopedia.css" rel="stylesheet"/>' + 'n' + str(main) + 'n</>n' return(content) ############ Parallel computing if __name__=="__main__": process_pool = Pool(processes = min(len(links), os.cpu_count())) thread_pool = ThreadPool(processes = min(len(links), 25)) content_list = thread_pool.map(partial(get_content, process_pool), links) content_all = ''.join(content_list) print(content_all)