Skip to content
Advertisement

Optimising Python script for scraping to avoid getting blocked/ draining resources

I have a fairly basic Python script that scrapes a property website, and stores the address and price in a csv file. There are over 5000 listings to go through but I find my current code times out after a while (about 2000 listings) and the console shows 302 and CORS policy errors.

JavaScript

As you can see I added sleep(randint(1, 5)) to add random intervals but I possibly need to do more. Of course I want to scrape the page in its entirety as quickly as possible but I also want to be respectful to the site that is being scraped and minimise burdening them.

Can anyone suggest updates? Ps forgive rookie errors, very new to Python/scraping!

Advertisement

Answer

This is one way of getting that data – bear in mind there are 251 pages only, with 12 properties on each of them, not over 5k:

JavaScript

Result printed in terminal:

JavaScript

See relevant documentation for Requests: https://requests.readthedocs.io/en/latest/

For Pandas: https://pandas.pydata.org/docs/

For BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/

And for TQDM: https://pypi.org/project/tqdm/

Advertisement