I am quite new to python and tried scraping some websites. A few of em worked well but i now stumbled upon one that is giving me a hard time. the url im using is: https://www.drankdozijn.nl/groep/rum. Im trying to get all product titles and urls from this page. But since there is a ::before in the HTML code i am unable to scrape it. Any help would be very appreciated! This is the code i have so far:
try:
source = requests.get(url)
source.raise_for_status()
soup = BeautifulSoup(source.text,'html.parser')
wachttijd = random.randint(2, 4)
print("Succes! URL:", url, "Wachttijd is:", wachttijd, "seconden")
productlist = soup.find('div', {'id':'app'})
for productinfo in productlist:
productnaam = getTextFromHTMLItem(productinfo.find('h3', {'class':'card-title lvl3'}))
product_url = getHREFFromHTMLItem(productinfo.find('a' , {'class':'ptile-v2_link'}))
# print info
print(productnaam)
# Informatie in sheet row plaatsen
print("Sheet append")
sheet.append([productnaam])
#time.sleep(1)
time.sleep(wachttijd)
print("Sheet opslaan")
excel.save('C:/Python/Files/RumUrlsDrankdozijn.xlsx')
return soup
except Exception as e:
print(e)
Advertisement
Answer
The product details for that site are returned via a different URL using JSON. The HTML returned does not contain this. This could easily be accessed as follows:
from bs4 import BeautifulSoup
import requests
import openpyxl
url = "https://es-api.drankdozijn.nl/products"
params = {
"country" : "NL",
"language" : "nl",
"page_template" : "groep",
"group" : "rum",
"page" : "1",
"listLength" : "20",
"clientFilters" : "{}",
"response" : "paginated",
"sorteerOp" : "relevance",
"ascdesc" : "asc",
"onlyAvail" : "false",
"cacheKey" : "1",
"premiumMember" : "N",
}
wb = openpyxl.Workbook()
ws = wb.active
ws.append(['Description', 'Price', 'URL', "Land", "AlcoholPercentage"])
for page in range(1, 11):
params['page'] = page
req = requests.get(url, params=params)
req.raise_for_status()
soup = BeautifulSoup(req.content, 'html.parser')
data = req.json()
for product in data['data']:
land = "unknown"
alcoholpercentage = "unknown"
features = {feature["alias"] : feature["value"]["description"] for feature in product['features']}
ws.append([
product["description"],
product["pricePerLiterFormatted"],
product["structuredData"]["offers"]["url"],
features["land"],
features["alcoholpercentage"]
])
wb.save('output.xlsx')
This gets the first 10 pages of details, starting:
I recommend you print(data)
to have a look at all of the information that is available.
The URL was found using the browser’s network tools to watch the request it made whilst loading the page. An alternative approach would be to use something like Selenium to fully render the HTML, but this will be slower and more resource intensive.
openpyxl
is used to create an output spreadsheet. You could modify the column width’s and appearance if needed for the Excel output.