Skip to content
Advertisement

Python & BS4 – Strange behaviour, scraper freezes/stops working after a while without an error

I’m trying to scrape eastbay.com for Jordans. I have set up my scraper using BS4 and it works, but never finishes or reports an error, just freezes at some point.

The strange thing is that it stops at some point and pressing CTRL+C in the Python console (where it’s outputting the prints as it’s running) does nothing, but it is supposed to stop the operation and report that it was stopped by the user. Also, after it stops, it saves the data it managed to scrape by that point in a .csv file. Curiously, if I run the program again, it will get some more data, and then freeze again. Every time I run it, it gets a bit more data, albeit with diminishing returns. I’ve never experienced anything like it.

I have set up my whole program which I will paste here, so if anyone has an idea why it would stop, please let me know.

JavaScript

Advertisement

Answer

There are things you should consider about this:

  1. The site has a rate limiting. Which means you can scale the API only for a limited time after which you’ll get blocked. Try capturing the response status code. If you get 429 Too Many Requests, then your being rate limited.
  2. The site has a WAF/IDS/IPS to prevent its API abuse.
  3. Due to too many requests within a short time, the site is becoming less responsive and hence your requests are getting timed out.

To resolve this there are ways:

  1. You give a default timeout of 7-8 sec and ignore the ones exceeding the timeout.
  2. You increase the timeout value to 15 secs.
  3. Delay your requests. Put a time.sleep(2) between your consecutive requests.
  4. Get a detailed logging system of status codes, exceptions, everything. This will help you understand where your script went wrong.
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement