Skip to content
Advertisement

Packages that are imported are not recognized during parallel computing?

I’m running the function get_content in parallel setting with multiprocess.Pool. Then it throws out an error NameError: name 'session' is not defined. Clearly, I defined it with session = requests.Session() . Could you please elaborate on this issue?

import requests, os
from bs4 import BeautifulSoup
from multiprocess import Pool, freeze_support
core = os.cpu_count()
session = requests.Session() 
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}

links = ['https://www.investopedia.com/terms/1/0x-protocol.asp',
         'https://www.investopedia.com/terms/1/1-10net30.asp',
         'https://www.investopedia.com/terms/1/10-k.asp',
         'https://www.investopedia.com/terms/1/10k-wrap.asp',
         'https://www.investopedia.com/terms/1/10q.asp']

############ Get content of a word
def get_content(l):
    r = session.get(l, headers = headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    entry_name = soup.select_one('#article-heading_3-0').contents[0]
    main = soup.select('.comp.article-body.mntl-block')[0]
    content = entry_name + 'n' + '<link href="investopedia.css" rel="stylesheet"/>' + 'n' + str(main) + 'n</>n'
    return(content)

############ Parallel computing
if __name__=="__main__":
    P = Pool(processes = core)   
    content_list = P.map(get_content, links)
    content_all = ''.join(content_list)    
    freeze_support()

Advertisement

Answer

First of all, your import statement is incorrect and should be:

from multiprocessing import Pool, freeze_support

(You had from multiprocess ..., so I am not sure how it ran at all)

With the correct import statement, the code runs for me, but it does not what you think it does! I surmise by the call to freeze_support that you are running under Windows. Under that platform new processes are called by calls to system function spawn, which results in the entire program executing from the very top. This is why it becomes necessary for the code that creates the new processes to be within a block that is governed by if __name__ == '__main__':. If it weren’t, then your newly created processes would be re-executing the code that just created them in a never-ending recursive loop spawning new processes forever.

This means that each process is re-creating its own Session instance due to the following statement being at global scope:

session = requests.Session()

So you are getting no real benefit of being able to re-use the same Session instance for the multiple URLs you are attempting to retrieve. In order to reuse the same Session instance, you must initialize the multiprocessing pool itself with the session object so that it resides in shared memory and visible to all processes. You should also only keep the minimal executable code at global scope:

import requests, os
from bs4 import BeautifulSoup
from multiprocessing import Pool, freeze_support

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}

def init_pool(s):
    global session
    session = s


############ Get content of a word
def get_content(l):
    r = session.get(l, headers = headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    entry_name = soup.select_one('#article-heading_3-0').contents[0]
    main = soup.select('.comp.article-body.mntl-block')[0]
    content = entry_name + 'n' + '<link href="investopedia.css" rel="stylesheet"/>' + 'n' + str(main) + 'n</>n'
    return(content)

############ Parallel computing
if __name__=="__main__":
    core = os.cpu_count()
    session = requests.Session()

    links = ['https://www.investopedia.com/terms/1/0x-protocol.asp',
             'https://www.investopedia.com/terms/1/1-10net30.asp',
             'https://www.investopedia.com/terms/1/10-k.asp',
             'https://www.investopedia.com/terms/1/10k-wrap.asp',
             'https://www.investopedia.com/terms/1/10q.asp']

    p = Pool(processes = core, initializer=init_pool, initargs=(session,))
    content_list = p.map(get_content, links)
    content_all = ''.join(content_list)
    print(content_all)
    freeze_support()

But in fact your code is mostly spending its time waiting for the URLs to be retrieved and just a little CPU time in processing the returned HTML. This is probably a good candidate for using multithreading instead of multiprocessing. The only changes you needed to your original code to use multithreading is (1) get rid of all references to freeze_support (which you did not need for multiprocessing unless you were planning on creating an exe file) and change one import statement:

from multiprocessing.dummy import Pool

Additionally, you should not be limited by the number of CPU cores you have in determining the number of threads to use (although there is some maximum you will not want to go over):

import requests, os
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool
session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}

links = ['https://www.investopedia.com/terms/1/0x-protocol.asp',
         'https://www.investopedia.com/terms/1/1-10net30.asp',
         'https://www.investopedia.com/terms/1/10-k.asp',
         'https://www.investopedia.com/terms/1/10k-wrap.asp',
         'https://www.investopedia.com/terms/1/10q.asp']

############ Get content of a word
def get_content(l):
    r = session.get(l, headers = headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    entry_name = soup.select_one('#article-heading_3-0').contents[0]
    main = soup.select('.comp.article-body.mntl-block')[0]
    content = entry_name + 'n' + '<link href="investopedia.css" rel="stylesheet"/>' + 'n' + str(main) + 'n</>n'
    return(content)

############ Concurrent computing
if __name__=="__main__":
    # max of 25 is arbitrary; we do not want to appear to be a denial of service attack
    P = Pool(processes = min(len(links), 25))
    content_list = P.map(get_content, links)
    content_all = ''.join(content_list)
    print(content_all)

And, finally, you can combine both a thread pool and a multiprocessing pool using the latter to take care of the CPU-intensive portion of the processing:

import requests, os
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
from multiprocessing.pool import Pool
import os
from functools import partial


session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}

links = ['https://www.investopedia.com/terms/1/0x-protocol.asp',
         'https://www.investopedia.com/terms/1/1-10net30.asp',
         'https://www.investopedia.com/terms/1/10-k.asp',
         'https://www.investopedia.com/terms/1/10k-wrap.asp',
         'https://www.investopedia.com/terms/1/10q.asp']

############ Get content of a word
def get_content(process_pool, l):
    r = session.get(l, headers = headers)
    return process_pool.apply(process_content, args=(r.content,))

def process_content(content):
    soup = BeautifulSoup(content, 'html.parser')
    entry_name = soup.select_one('#article-heading_3-0').contents[0]
    main = soup.select('.comp.article-body.mntl-block')[0]
    content = entry_name + 'n' + '<link href="investopedia.css" rel="stylesheet"/>' + 'n' + str(main) + 'n</>n'
    return(content)


############ Parallel computing
if __name__=="__main__":
    process_pool = Pool(processes = min(len(links), os.cpu_count()))
    thread_pool = ThreadPool(processes = min(len(links), 25))
    content_list = thread_pool.map(partial(get_content, process_pool), links)
    content_all = ''.join(content_list)
    print(content_all)
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement