I am quite new to python and tried scraping some websites. A few of em worked well but i now stumbled upon one that is giving me a hard time. the url im using is: https://www.drankdozijn.nl/groep/rum. Im trying to get all product titles and urls from this page. But since there is a ::before in the HTML code i am unable to scrape it. Any help would be very appreciated! This is the code i have so far:
try: source = requests.get(url) source.raise_for_status() soup = BeautifulSoup(source.text,'html.parser') wachttijd = random.randint(2, 4) print("Succes! URL:", url, "Wachttijd is:", wachttijd, "seconden") productlist = soup.find('div', {'id':'app'}) for productinfo in productlist: productnaam = getTextFromHTMLItem(productinfo.find('h3', {'class':'card-title lvl3'})) product_url = getHREFFromHTMLItem(productinfo.find('a' , {'class':'ptile-v2_link'})) # print info print(productnaam) # Informatie in sheet row plaatsen print("Sheet append") sheet.append([productnaam]) #time.sleep(1) time.sleep(wachttijd) print("Sheet opslaan") excel.save('C:/Python/Files/RumUrlsDrankdozijn.xlsx') return soup except Exception as e: print(e)
Advertisement
Answer
The product details for that site are returned via a different URL using JSON. The HTML returned does not contain this. This could easily be accessed as follows:
from bs4 import BeautifulSoup import requests import openpyxl url = "https://es-api.drankdozijn.nl/products" params = { "country" : "NL", "language" : "nl", "page_template" : "groep", "group" : "rum", "page" : "1", "listLength" : "20", "clientFilters" : "{}", "response" : "paginated", "sorteerOp" : "relevance", "ascdesc" : "asc", "onlyAvail" : "false", "cacheKey" : "1", "premiumMember" : "N", } wb = openpyxl.Workbook() ws = wb.active ws.append(['Description', 'Price', 'URL', "Land", "AlcoholPercentage"]) for page in range(1, 11): params['page'] = page req = requests.get(url, params=params) req.raise_for_status() soup = BeautifulSoup(req.content, 'html.parser') data = req.json() for product in data['data']: land = "unknown" alcoholpercentage = "unknown" features = {feature["alias"] : feature["value"]["description"] for feature in product['features']} ws.append([ product["description"], product["pricePerLiterFormatted"], product["structuredData"]["offers"]["url"], features["land"], features["alcoholpercentage"] ]) wb.save('output.xlsx')
This gets the first 10 pages of details, starting:
I recommend you print(data)
to have a look at all of the information that is available.
The URL was found using the browser’s network tools to watch the request it made whilst loading the page. An alternative approach would be to use something like Selenium to fully render the HTML, but this will be slower and more resource intensive.
openpyxl
is used to create an output spreadsheet. You could modify the column width’s and appearance if needed for the Excel output.