I have 60 proxies (residential, with username and password). I want to scrape 10000 webpages. I want to rotate over the IPs, so that 1 IP per thread is used every 1 second. So every second there are 60 threads, each thread scraping 1 page.
But I just can’t do it.
The best I was able to do is the below program. It does 1 IP per thread, but only for 60 pages. I want it to continue until all 10000 pages are scraped.
How can I do that? Would asyncio be a better choice?
import threading import requests import time import lxml.html import csv from concurrent.futures import ThreadPoolExecutor def scrape_page(html, url): SCRAPE STUFF FROM URL return LIST def download(url, proxy): try: proxy = {"https": proxy, "http": proxy} r = requests.get(url, proxies=proxy, stream=True) r.raw_decode_content = True time.sleep(1) except Exception as err: print(url, "503") return scrape_page(r.text, url) websites = LIST WITH 10000 SITES ROTATING_PROXY_LIST = LIST WITH 60 PROXIES with ThreadPoolExecutor(max_workers=60) as executor: data = [] for result in executor.map(download, websites, ROTATING_PROXY_LIST): data.append(result) with open("results.csv", "w", newline="n", encoding="utf8") as f: writer = csv.writer(f, delimiter="t") writer.writerows(data)
Advertisement
Answer
The problem is that when you write this:
executor.map(download, websites, ROTATING_PROXY_LIST)
You’re effectively asking for zip(websites, ROTATING_PROXY_LIST)
, which will only ever be as long as the shorttest iterable. You can solve this by making ROTATING_PROXY_LIST
effectively infinite:
import itertools . . . with ThreadPoolExecutor(max_workers=60) as executor: data = [] for result in executor.map(download, websites, itertools.cycle(ROTATING_PROXY_LIST)): data.append(result)
itertools.cycle
will “Return elements from the iterable until it is exhausted. Then repeat the sequence indefinitely.”