Skip to content
Advertisement

Webscrape a product website like thingiverse

I am very new in webscraping and I am trying to do a small project where I can scrape a website like Thingiverse or similar where different CAD (or similar) files are shown. I am trying to for a particular Search keyword obtain a list of all the results. When I inspect the website the different products are highlighted in this part of the HTML :

<div class="SearchResult__searchResultItem--c4VZk">

However when I go into my script and type:

29/11 Edited:

JavaScript

I get a 0 sized list. What am I doing wrong?

Advertisement

Answer

For the original question:

Class is passed as a dictionary items. Therefore change code to soup.find_all('div', { 'class' :'SearchResult__searchResultItem--c4VZk'})

This demo BeautifulSoup scraping the html:

JavaScript

Output:

JavaScript

For your edited question:

BeautifulSoup(page, "lxml") this passes in your response object and not your HTML. The response object will contain HTTP status, headers and all sorts of information. To get the HTML try html = page.read().

The website is loading html tags via JavaScript. Therefore urllib.request / BeautifulSoup will not be able to extract the data. You can test this by printing out the html using print(soup.prettify()). To get around this issue you can use some sort of web automation tool like selenium.

Had the website returned the HTML as expected. The scrape code would look something like:

JavaScript
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement