Web scraping content of ::before using BeautifulSoup?

Question

I am quite new to python and tried scraping some websites. A few of em worked well but i now stumbled upon one that is giving me a hard time. the url im using is: https://www.drankdozijn.nl/groep/rum. Im trying to get all product titles and urls from this page. But since there is a ::before in the HTML code i…

Accepted Answer

The product details for that site are returned via a different URL using JSON. The HTML returned does not contain this. This could easily be accessed as follows:from bs4 import BeautifulSoupimport requestsimport openpyxlurl = "https://es-api.drankdozijn.nl/products"params = {    "country" : "NL",    "language" : "nl",    "page_template" : "groep",    "group" : "rum",    "page" : "1",    "listLength" : "20",    "clientFilters" : "{}",    "response" : "paginated",    "sorteerOp" : "relevance",    "ascdesc" : "asc",    "onlyAvail" : "false",    "cacheKey" : "1",    "premiumMember" : "N",}wb = openpyxl.Workbook()ws = wb.activews.append(['Description', 'Price', 'URL', "Land", "AlcoholPercentage"])for page in range(1, 11):    params['page'] = page    req = requests.get(url, params=params)    req.raise_for_status()    soup = BeautifulSoup(req.content, 'html.parser')    data = req.json()    for product in data['data']:        land = "unknown"        alcoholpercentage = "unknown"        features = {feature["alias"] : feature["value"]["description"] for feature in product['features']}            ws.append([            product["description"],             product["pricePerLiterFormatted"],             product["structuredData"]["offers"]["url"],            features["land"],            features["alcoholpercentage"]        ])    wb.save('output.xlsx')  This gets the first 10 pages of details, starting:I recommend you print(data) to have a look at all of the information that is available.The URL was found using the browser&#8217;s network tools to watch the request it made whilst loading the page. An alternative approach would be to use something like Selenium to fully render the HTML, but this will be slower and more resource intensive.openpyxl is used to create an output spreadsheet. You could modify the column width&#8217;s and appearance if needed for the Excel output.

Advertisement

Answer