Here is a similar question Why does python multiprocessing script slow down after a while?
Sample of code that uses Pool:
from multiprocessing import Pool Pool(processes=6).map(some_func, array)
After few iterations the program slows down and finally it becomes even slower than without multiprocessing. Maybe the problem is that the function related to Selenium? Here is full code:
# libraries import os from time import sleep from bs4 import BeautifulSoup from selenium import webdriver from multiprocessing import Pool # Необходимые переменные url = "https://eldorado.ua/" directory = os.path.dirname(os.path.realpath(__file__)) env_path = directory + "chromedriver" chromedriver_path = env_path + "chromedriver.exe" dict1 = {"Смартфоны и телефоны": "https://eldorado.ua/node/c1038944/", "Телевизоры и аудиотехника": "https://eldorado.ua/node/c1038957/", "Ноутбуки, ПК и Планшеты": "https://eldorado.ua/node/c1038958/", "Техника для кухни": "https://eldorado.ua/node/c1088594/", "Техника для дома": "https://eldorado.ua/node/c1088603/", "Игровая зона": "https://eldorado.ua/node/c1285101/", "Гаджеты и аксесуары": "https://eldorado.ua/node/c1215257/", "Посуда": "https://eldorado.ua/node/c1039055/", "Фото и видео": "https://eldorado.ua/node/c1038960/", "Красота и здоровье": "https://eldorado.ua/node/c1178596/", "Авто и инструменты": "https://eldorado.ua/node/c1284654/", "Спорт и туризм": "https://eldorado.ua/node/c1218544/", "Товары для дома и сада": "https://eldorado.ua/node/c1285161/", "Товары для детей": "https://eldorado.ua/node/c1085100/"} def openChrome_headless(url1, name): options = webdriver.ChromeOptions() options.headless = True options.add_experimental_option("excludeSwitches", ['enable-automation']) options.add_argument( '--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36"') driver = webdriver.Chrome(executable_path=chromedriver_path, options=options) driver.get(url=url1) sleep(1) try: with open(name + ".html", "w", encoding="utf-8") as file: file.write(driver.page_source) except Exception as ex: print(ex) finally: driver.close() driver.quit() def processing_goods_pages(name): for n in os.listdir(f"brand_pages\{name}"): with open(f"{directory}\brand_pages\{name}\{n}", encoding="utf-8") as file: soup = BeautifulSoup(file.read(), "lxml") if not os.path.exists(f"{directory}\goods_pages\{name}\{n[:-5]}"): if not os.path.exists(f"{directory}\goods_pages\{name}"): os.mkdir(f"{directory}\goods_pages\{name}") os.mkdir(f"{directory}\goods_pages\{name}\{n[:-5]}") links = soup.find_all("header", class_="good-description") for li in links: ref = url + li.find('a').get('href') print(li.text) openChrome_headless(ref, f"{directory}\goods_pages\{name}\{n[:-5]}\{li.text}") if __name__ == "__main__": ar2 = [] for k, v in dict1.items(): ar2.append(k) Pool(processes=6).map(processing_goods_pages, ar2)
Advertisement
Answer
You are creating 6 processes to process 14 URLs — so far so good. But then each process in the pool in order to process a URL is launching a headless Chrome browser once for each link it reads from a file for that URL. I don’t know how many links on the average it processes for each URL and I can’t say that opening and closing Chrome so many times is the cause of the eventual slowdown. But it seems to me that if you want a multiprocessing level of 6, then you should never have to have more than 6 Chrome sessions started. To accomplish this, however, takes a bit of code refactoring.
The first thing I would note is that this job could probably just as well use multithreading instead of multiprocessing. There is some CPU-intensive work done by BeautifulSoup
and the lxml
parser, but I suspect this pales in comparison to launching Chrome 6 times and fetching the URL pages, especially since you have a hard-coded wait of 1 second following the URL fetch (more on this later).
The idea is to store in thread local storage the currently open Chrome driver for each thread in the multithreading pool and to never quit
the driver until the end of the program. The logic that was in function openChrome_headless
now needs to be moved to a new special function create_driver
that can be called by processing_goods_pages
to get the current Chrome driver for the current thread (or create one if there isn’t one currently). But that means the URL-specific code that had been in openChrome_headlesss
now needs to be moved to processing_goods_pages
.
Finally, thread local storage is deleted and the garbage collector is run to ensure that the destructor for all the instances of class Driver
are run to ensure that all the Chrome driver instances are “quitted.”
Since I do not have access to your files, this obviously could not be thoroughly tested, so there could be a spelling error or 10. Good luck.
One further note: Instead of doing a call to sleep(1)
following the driver.get(ref)
call, you should look into doin instead a call to driver.implicitly_wait(1)
followed by a driver call to locate an element whose presence ensures that everything you need on the page for writing out has been load, if such a thing is possible. In that way you are only waiting the minimum time necessary for the links to be present. Of course, if the DOM is not modified subsequent to the initial page load via AJAX calls, there is no need to sleep at all.
import os from time import sleep from bs4 import BeautifulSoup from selenium import webdriver # Use multithreading instead of multiprocessing from multiprocessing.pool import ThreadPool import threading # Необходимые переменные url = "https://eldorado.ua/" directory = os.path.dirname(os.path.realpath(__file__)) env_path = directory + "chromedriver" chromedriver_path = env_path + "chromedriver.exe" class Driver: def __init__(self): options = webdriver.ChromeOptions() options.headless = True options.add_experimental_option("excludeSwitches", ['enable-automation']) options.add_argument( '--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36"') self.driver = webdriver.Chrome(executable_path=chromedriver_path, options=options) def __del__(self): self.driver.quit() # clean up driver when we are cleaned up #print('The driver has been "quitted".') threadLocal = threading.local() def create_driver(): the_driver = getattr(threadLocal, 'the_driver', None) if the_driver is None: the_driver = Driver() setattr(threadLocal, 'the_driver', the_driver) return the_driver.driver dict1 = {"Смартфоны и телефоны": "https://eldorado.ua/node/c1038944/", "Телевизоры и аудиотехника": "https://eldorado.ua/node/c1038957/", "Ноутбуки, ПК и Планшеты": "https://eldorado.ua/node/c1038958/", "Техника для кухни": "https://eldorado.ua/node/c1088594/", "Техника для дома": "https://eldorado.ua/node/c1088603/", "Игровая зона": "https://eldorado.ua/node/c1285101/", "Гаджеты и аксесуары": "https://eldorado.ua/node/c1215257/", "Посуда": "https://eldorado.ua/node/c1039055/", "Фото и видео": "https://eldorado.ua/node/c1038960/", "Красота и здоровье": "https://eldorado.ua/node/c1178596/", "Авто и инструменты": "https://eldorado.ua/node/c1284654/", "Спорт и туризм": "https://eldorado.ua/node/c1218544/", "Товары для дома и сада": "https://eldorado.ua/node/c1285161/", "Товары для детей": "https://eldorado.ua/node/c1085100/"} def processing_goods_pages(name): for n in os.listdir(f"brand_pages\{name}"): with open(f"{directory}\brand_pages\{name}\{n}", encoding="utf-8") as file: soup = BeautifulSoup(file.read(), "lxml") if not os.path.exists(f"{directory}\goods_pages\{name}\{n[:-5]}"): if not os.path.exists(f"{directory}\goods_pages\{name}"): os.mkdir(f"{directory}\goods_pages\{name}") os.mkdir(f"{directory}\goods_pages\{name}\{n[:-5]}") links = soup.find_all("header", class_="good-description") driver = create_driver() for li in links: ref = url + li.find('a').get('href') print(li.text) driver.get(ref) sleep(1) name = f"{directory}\goods_pages\{name}\{n[:-5]}\{li.text}" try: with open(name + ".html", "w", encoding="utf-8") as file: file.write(driver.page_source) except Exception as ex: print(ex) if __name__ == "__main__": ThreadPool(processes=6).map(processing_goods_pages, dict1.keys()) # Quit all the Selenium drivers: del threadLocal import gc gc.collect() # a little extra insurance