Python

I’m trying to scrape eastbay.com for Jordans. I have set up my scraper using BS4 and it works, but never finishes or reports an error, just freezes at some point.

The strange thing is that it stops at some point and pressing CTRL+C in the Python console (where it’s outputting the prints as it’s running) does nothing, but it is supposed to stop the operation and report that it was stopped by the user. Also, after it stops, it saves the data it managed to scrape by that point in a .csv file. Curiously, if I run the program again, it will get some more data, and then freeze again. Every time I run it, it gets a bit more data, albeit with diminishing returns. I’ve never experienced anything like it.

I have set up my whole program which I will paste here, so if anyone has an idea why it would stop, please let me know.

import requests
import csv
import io
import json
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from datetime import datetime
from bs4 import BeautifulSoup

url = 'https://www.eastbay.com/api/products/search'

session = requests.Session()
session.max_redirects = 30

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}

payload = {
'query': ':relevance:gender:200000:productType:200005:brand:Jordan',
'currentPage': '0',
'pageSize': '200',
'timestamp': '4'}

jsonData = session.get(url, headers=headers, params=payload).json()

totalPages = jsonData['pagination']['totalPages']
totalResults = jsonData['pagination']['totalResults']

print ('%s total results to acquire' %totalResults)

container = []

for page in range(0,totalPages+1):
    payload = {
            'query': ':relevance:gender:200000:productType:200005:brand:Jordan',
            'currentPage': page,
            'pageSize': '200',
            'timestamp': '4'}


    jsonData = session.get(url, headers=headers, params=payload).json()

    try:
        for product in jsonData['products']:
            name = (product['name'])
            removal_list4 = [" ", "/", "'"]
            for word4 in removal_list4:
                name = name.replace(word4, "")
            url2 = (product['url'])
            url3 = "https://www.eastbay.com/product/"+name+"/"+url2+".html"
            container.append(url3)
    except:
        print ('Products not found on this request')

print(container)

timeanddate=datetime.now().strftime("%Y%m%d-%H%M%S")

folder_path = 'my_path'
file_name = 'eastbay_jordans_'+timeanddate+'.csv'
full_name = os.path.join(folder_path, file_name)

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
    }

with io.open(full_name, 'w', newline='', encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["Brand", "Model", "SKU", "Color", "Size", "Price", "Link"])
    
    for url3 in container:
        
        data2 = session.get(url3,headers=headers)
        soup2 = BeautifulSoup(data2.text, 'lxml')
        
        if not soup2.find('script', attrs={'type': 'application/ld+json'}):
            brand = "Unavailable"
            getbrand = "Unavailable"
        else:
            brand = soup2.find('script', attrs={'type': 'application/ld+json'})
            getbrand = json.loads(brand.text)['brand']
        if not soup2.find('span', attrs={'class': 'ProductName-primary'}):
            model = "Unavailable"
        else:
            model = soup2.find('span', attrs={'class': 'ProductName-primary'}).text.strip()
            removal_list2 = [" - ", "NIKE", "Nike", "Jordan", "JORDAN", "REEBOK", "CHAMPION", "TIMBERLANDS", "FILA", "LACOSTE", "CONVERSE", "Adidas", "ADIDAS", "New Balance", "NEW BALANCE", "Vans", "Puma", "UGG", "Saucony", "Reebok", "Women's ", "adidas", "Dr. Martens", "Converse", "Fila", "PUMA", "Champion", "Diadora", "Timberland", "SNKR PROJECT", "Women's ", "Men's ", "Unisex ", "Under Armour", "UNDER ARMOUR"]
            for word2 in removal_list2:
                model = model.replace(word2, "")
        if not soup2.find('div', attrs={'class': 'Tab-panel'}):
            sku = "Unavailable"
            getsku = "Unavailable"
        else:
            sku = soup2.find('div', attrs={'class': 'Tab-panel'})
            for child in sku.findAll("div"):
                child.decompose()
            getsku = sku.get_text()
            removal_list3 = ["Product #: "]
            for word3 in removal_list3:
                getsku = getsku.replace(word3, "")
        if not soup2.find('p', attrs={'class': 'ProductDetails-form__label'}):
            color = "Unavailable"
        else:
            color = soup2.find('p', attrs={'class': 'ProductDetails-form__label'}).text.strip()
        if not soup2.find('div', attrs={'class': 'ProductSize-group'}):
            size = "Unavailable"
            getsize = "Unavailable"
        else:
            size = soup2.find('div', attrs={'class': 'ProductSize-group'})
            getsize = [item.text.strip() for item in size.select('div.c-form-field.c-form-field--radio.ProductSize:not(div.c-form-field.c-form-field--radio.c-form-field--disabled.ProductSize)')]
        if not soup2.find('div', attrs={'class': 'ProductPrice'}):
            price = "Unavailable"
        elif not soup2.find('span', attrs={'class': 'ProductPrice-final'}):
            price = soup2.find('div', attrs={'class': 'ProductPrice'}).text.strip()
        else:
            price = soup2.find('span', attrs={'class': 'ProductPrice-final'}).text.strip()
        productlink = url3
        #Print for test purposes
        print(getbrand,model,getsku,color,getsize,price,productlink)
        writer.writerow([getbrand, model, getsku, color, getsize, price, productlink])
    file.close()

Answer

There are things you should consider about this:

The site has a rate limiting. Which means you can scale the API only for a limited time after which you’ll get blocked. Try capturing the response status code. If you get 429 Too Many Requests, then your being rate limited.
The site has a WAF/IDS/IPS to prevent its API abuse.
Due to too many requests within a short time, the site is becoming less responsive and hence your requests are getting timed out.

To resolve this there are ways:

You give a default timeout of 7-8 sec and ignore the ones exceeding the timeout.
You increase the timeout value to 15 secs.
Delay your requests. Put a time.sleep(2) between your consecutive requests.
Get a detailed logging system of status codes, exceptions, everything. This will help you understand where your script went wrong.

Python & BS4 – Strange behaviour, scraper freezes/stops working after a while without an error

Advertisement

Answer