Optimising Python script for scraping to avoid getting blocked/ draining resources

Question

I have a fairly basic Python script that scrapes a property website, and stores the address and price in a csv file. There are over 5000 listings to go through but I find my current code times out after a while (about 2000 listings) and the console shows 302 and CORS policy errors. As you can see I added sleep(randint(1,

Accepted Answer

This is one way of getting that data &#8211; bear in mind there are 251 pages only, with 12 properties on each of them, not over 5k:import requestsimport pandas as pdfrom tqdm import tqdmfrom bs4 import BeautifulSoup as bspd.set_option('display.max_columns', None)pd.set_option('display.max_colwidth', None)headers = {    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',    'accept': 'application/json',    'accept-language': 'en-US,en;q=0.9',    'sec-fetch-mode': 'navigate',    'sec-fetch-site': 'same-origin'}s = requests.Session()s.headers.update(headers)big_list = []for x in tqdm(range(1, 252)):    soup = bs(s.get(f'https://www.propertypal.com/property-for-sale/northern-ireland/page-{x}').text, 'html.parser')#     print(soup)    properties = soup.select('li.pp-property-box')    for p in properties:        name = p.select_one('h2').get_text(strip=True) if p.select_one('h2') else None        url = 'https://www.propertypal.com' + p.select_one('a').get('href') if p.select_one('a') else None        price = p.select_one('p.pp-property-price').get_text(strip=True) if p.select_one('p.pp-property-price') else None        big_list.append((name, price, url))big_df = pd.DataFrame(big_list, columns = ['Property', 'Price', 'Url'])print(big_df)Result printed in terminal:100%251/251 [03:41<00:00, 1.38it/s]Property    Price   Url0   22 Erinvale Gardens, Belfast, BT10 0FS  Asking price£165,000    https://www.propertypal.com/22-erinvale-gardens-belfast/7778201   Laurel Hill, 37 Station Road, Saintfield, BT24 7DZ  Guide price£725,000 https://www.propertypal.com/laurel-hill-37-station-road-saintfield/7512742   19 Carrick Brae, Burren Warrenpoint, Newry, BT34 3TH    Guide price£265,000 https://www.propertypal.com/19-carrick-brae-burren-warrenpoint-newry/7753023   7b Conway Street, Lisburn, BT27 4AD Offers around£299,950   https://www.propertypal.com/7b-conway-street-lisburn/7798334   Hartley Hall, Greenisland   From£280,000to£397,500  https://www.propertypal.com/hartley-hall-greenisland/d850... ... ... ...3007    8 Shimna Close, Newtownards, BT23 4PE   Offers around£99,950    https://www.propertypal.com/8-shimna-close-newtownards/7568253008    7 Barronstown Road, Dromore, BT25 1NT   Guide price£380,000 https://www.propertypal.com/7-barronstown-road-dromore/7565393009    39 Tamlough Road, Randalstown, BT41 3DP Offers around£425,000   https://www.propertypal.com/39-tamlough-road-randalstown/7532993010    Glengeen House, 17 Carnalea Road, Fintona, BT78 2BY Offers over£180,000 https://www.propertypal.com/glengeen-house-17-carnalea-road-fintona/7501053011    Walnut Road, Larne, BT40 2WE    Offers around£169,950   https://www.propertypal.com/walnut-road-larne/7497333012 rows × 3 columnsSee relevant documentation for Requests: https://requests.readthedocs.io/en/latest/For Pandas: https://pandas.pydata.org/docs/For BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/And for TQDM: https://pypi.org/project/tqdm/

Advertisement

Answer