I am trying to collect the size of homepages of a list of sites using multiprocessing. Following is the code :
import time from multiprocessing import Pool, TimeoutError start = time.time() def sitesize(url): for url in sites: with urllib.request.urlopen(url) as u: page = u.read() print(url, len(page)) sites = [ 'https://www.yahoo.com', 'http://www.cnn.com', 'http://www.python.org', 'http://www.jython.org', 'http://www.pypy.org', 'http://www.perl.org', 'http://www.cisco.com', 'http://www.facebook.com', 'http://www.twitter.com', 'http://arstechnica.com', 'http://www.reuters.com', 'http://www.abcnews.com', 'http://www.cnbc.com', ] if __name__ == '__main__': with Pool(processes=4) as pool: for result in pool.imap_unordered(sitesize, sites): print(result) print(f'Time taken : {time.time() - start}')
I am having a Windows 10 laptop with Python 3.9 running. I am not using venv.
This code goes into a loop – executes 4 times and takes 4 times longer.
What is the error here ? Can someone help ?
Thanks in advance
Sachin
Advertisement
Answer
I think you misunderstood how the pool.imap_unordered
works, the provided function will be called with one of the values from the sites
, whereas in your case you actually completely discard the provided url
and loop on all values in the sites
list.
You should simply do
def sitesize(url): with urllib.request.urlopen(url) as u: page = u.read() print(url, len(page))
See the doc.