Skip to content
Advertisement

How to scrape websites with threads – 1 IP per thread?

I have 60 proxies (residential, with username and password). I want to scrape 10000 webpages. I want to rotate over the IPs, so that 1 IP per thread is used every 1 second. So every second there are 60 threads, each thread scraping 1 page.

But I just can’t do it.

The best I was able to do is the below program. It does 1 IP per thread, but only for 60 pages. I want it to continue until all 10000 pages are scraped.

How can I do that? Would asyncio be a better choice?

JavaScript

Advertisement

Answer

The problem is that when you write this:

JavaScript

You’re effectively asking for zip(websites, ROTATING_PROXY_LIST), which will only ever be as long as the shorttest iterable. You can solve this by making ROTATING_PROXY_LIST effectively infinite:

JavaScript

itertools.cycle will “Return elements from the iterable until it is exhausted. Then repeat the sequence indefinitely.”

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement