I have a fairly basic Python script that scrapes a property website, and stores the address and price in a csv file. There are over 5000 listings to go through but I find my current code times out after a while (about 2000 listings) and the console shows 302 and CORS policy errors.
import requests import itertools from bs4 import BeautifulSoup from csv import writer from random import randint from time import sleep from datetime import date url = "https://www.propertypal.com/property-for-sale/northern-ireland/page-" headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'} filename = date.today().strftime("ni-listings-%Y-%m-%d.csv") with open(filename, 'w', encoding='utf8', newline='') as f: thewriter = writer(f) header = ['Address', 'Price'] thewriter.writerow(header) # for page in range(1, 3): for page in itertools.count(1): req = requests.get(f"{url}{page}", headers=headers) soup = BeautifulSoup(req.content, 'html.parser') for li in soup.find_all('li', class_="pp-property-box"): title = li.find('h2').text price = li.find('p', class_="pp-property-price").text info = [title, price] thewriter.writerow(info) sleep(randint(1, 5)) # this script scrapes all pages and records all listings and their prices in daily csv
As you can see I added sleep(randint(1, 5))
to add random intervals but I possibly need to do more. Of course I want to scrape the page in its entirety as quickly as possible but I also want to be respectful to the site that is being scraped and minimise burdening them.
Can anyone suggest updates? Ps forgive rookie errors, very new to Python/scraping!
Advertisement
Answer
This is one way of getting that data – bear in mind there are 251 pages only, with 12 properties on each of them, not over 5k:
import requests import pandas as pd from tqdm import tqdm from bs4 import BeautifulSoup as bs pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36', 'accept': 'application/json', 'accept-language': 'en-US,en;q=0.9', 'sec-fetch-mode': 'navigate', 'sec-fetch-site': 'same-origin' } s = requests.Session() s.headers.update(headers) big_list = [] for x in tqdm(range(1, 252)): soup = bs(s.get(f'https://www.propertypal.com/property-for-sale/northern-ireland/page-{x}').text, 'html.parser') # print(soup) properties = soup.select('li.pp-property-box') for p in properties: name = p.select_one('h2').get_text(strip=True) if p.select_one('h2') else None url = 'https://www.propertypal.com' + p.select_one('a').get('href') if p.select_one('a') else None price = p.select_one('p.pp-property-price').get_text(strip=True) if p.select_one('p.pp-property-price') else None big_list.append((name, price, url)) big_df = pd.DataFrame(big_list, columns = ['Property', 'Price', 'Url']) print(big_df)
Result printed in terminal:
100% 251/251 [03:41<00:00, 1.38it/s] Property Price Url 0 22 Erinvale Gardens, Belfast, BT10 0FS Asking price£165,000 https://www.propertypal.com/22-erinvale-gardens-belfast/777820 1 Laurel Hill, 37 Station Road, Saintfield, BT24 7DZ Guide price£725,000 https://www.propertypal.com/laurel-hill-37-station-road-saintfield/751274 2 19 Carrick Brae, Burren Warrenpoint, Newry, BT34 3TH Guide price£265,000 https://www.propertypal.com/19-carrick-brae-burren-warrenpoint-newry/775302 3 7b Conway Street, Lisburn, BT27 4AD Offers around£299,950 https://www.propertypal.com/7b-conway-street-lisburn/779833 4 Hartley Hall, Greenisland From£280,000to£397,500 https://www.propertypal.com/hartley-hall-greenisland/d850 ... ... ... ... 3007 8 Shimna Close, Newtownards, BT23 4PE Offers around£99,950 https://www.propertypal.com/8-shimna-close-newtownards/756825 3008 7 Barronstown Road, Dromore, BT25 1NT Guide price£380,000 https://www.propertypal.com/7-barronstown-road-dromore/756539 3009 39 Tamlough Road, Randalstown, BT41 3DP Offers around£425,000 https://www.propertypal.com/39-tamlough-road-randalstown/753299 3010 Glengeen House, 17 Carnalea Road, Fintona, BT78 2BY Offers over£180,000 https://www.propertypal.com/glengeen-house-17-carnalea-road-fintona/750105 3011 Walnut Road, Larne, BT40 2WE Offers around£169,950 https://www.propertypal.com/walnut-road-larne/749733 3012 rows × 3 columns
See relevant documentation for Requests: https://requests.readthedocs.io/en/latest/
For Pandas: https://pandas.pydata.org/docs/
For BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/
And for TQDM: https://pypi.org/project/tqdm/