Skip to content
Advertisement

Web scraping content of ::before using BeautifulSoup?

I am quite new to python and tried scraping some websites. A few of em worked well but i now stumbled upon one that is giving me a hard time. the url im using is: https://www.drankdozijn.nl/groep/rum. Im trying to get all product titles and urls from this page. But since there is a ::before in the HTML code i am unable to scrape it. Any help would be very appreciated! This is the code i have so far:

try:
    source = requests.get(url)
    source.raise_for_status()
    soup = BeautifulSoup(source.text,'html.parser')
    wachttijd = random.randint(2, 4)
    print("Succes! URL:", url, "Wachttijd is:", wachttijd, "seconden")

    productlist = soup.find('div', {'id':'app'})
    for productinfo in productlist:
        productnaam = getTextFromHTMLItem(productinfo.find('h3', {'class':'card-title lvl3'}))
        product_url = getHREFFromHTMLItem(productinfo.find('a' , {'class':'ptile-v2_link'}))

    # print info
    print(productnaam)
    # Informatie in sheet row plaatsen

    print("Sheet append")
    sheet.append([productnaam])
        #time.sleep(1)


    time.sleep(wachttijd)
    print("Sheet opslaan")
    excel.save('C:/Python/Files/RumUrlsDrankdozijn.xlsx')
    return soup

except Exception as e:
    print(e)

Advertisement

Answer

The product details for that site are returned via a different URL using JSON. The HTML returned does not contain this. This could easily be accessed as follows:

from bs4 import BeautifulSoup
import requests
import openpyxl

url = "https://es-api.drankdozijn.nl/products"

params = {
    "country" : "NL",
    "language" : "nl",
    "page_template" : "groep",
    "group" : "rum",
    "page" : "1",
    "listLength" : "20",
    "clientFilters" : "{}",
    "response" : "paginated",
    "sorteerOp" : "relevance",
    "ascdesc" : "asc",
    "onlyAvail" : "false",
    "cacheKey" : "1",
    "premiumMember" : "N",
}

wb = openpyxl.Workbook()
ws = wb.active
ws.append(['Description', 'Price', 'URL', "Land", "AlcoholPercentage"])

for page in range(1, 11):
    params['page'] = page
    req = requests.get(url, params=params)
    req.raise_for_status()
    soup = BeautifulSoup(req.content, 'html.parser')
    data = req.json()

    for product in data['data']:
        land = "unknown"
        alcoholpercentage = "unknown"
        features = {feature["alias"] : feature["value"]["description"] for feature in product['features']}
    
        ws.append([
            product["description"], 
            product["pricePerLiterFormatted"], 
            product["structuredData"]["offers"]["url"],
            features["land"],
            features["alcoholpercentage"]
        ])
    
wb.save('output.xlsx')  

This gets the first 10 pages of details, starting:

Excel screenshot

I recommend you print(data) to have a look at all of the information that is available.

The URL was found using the browser’s network tools to watch the request it made whilst loading the page. An alternative approach would be to use something like Selenium to fully render the HTML, but this will be slower and more resource intensive.

openpyxl is used to create an output spreadsheet. You could modify the column width’s and appearance if needed for the Excel output.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement