Webscrape a product website like thingiverse

Tags: ,



I am very new in webscraping and I am trying to do a small project where I can scrap a website like Thingiverse or similar where different CAD (or similar) files are shown. I am trying to for a particular Search keyword obtain a list of all the results. When I inspect the website the different products are highlighted in this part of the HTML :

<div class="SearchResult__searchResultItem--c4VZk">

However when I go into my script and type:

29/11 Edited:

from bs4 import BeautifulSoup
page = urlopen("https://www.thingiverse.com/search?q=vader&type=things&sort=relevant")
soup = BeautifulSoup(page, "lxml")

product_list = soup.find_all({ 'class' :'SearchResult__searchResultItem--c4VZk'})

I get a 0 sized list. What am I doing wrong?

Answer

For the original question:

Class is passed as a dictionary items. Therefore change code to soup.find_all('div', { 'class' :'SearchResult__searchResultItem--c4VZk'})

This demo BeautifulSoup scraping the html:

from bs4 import BeautifulSoup

html = '''<div class="SearchResult__searchResultItem--c4VZk">Test</div>'''
soup=BeautifulSoup(html,'html.parser')
Result_list = soup.find_all('div', { 'class' :'SearchResult__searchResultItem--c4VZk'})
print(Result_list)

Output:

[<div class="SearchResult__searchResultItem--c4VZk">Test</div>]

For your edited question:

BeautifulSoup(page, "lxml") this passes in your response object and not your HTML. The response object will contain HTTP status, headers and all sorts of information. To get the HTML try html = page.read().

The website is loading html tags via JavaScript. Therefore urllib.request / BeautifulSoup will not be able to extract the data. You can test this by printing out the html using print(soup.prettify()). To get around this issue you can use some sort of web automation tool like selenium.

Had the website returned the HTML as expected. The scrape code would look something like:

from urllib.request import urlopen
from bs4 import BeautifulSoup

with urlopen("https://www.thingiverse.com/search?q=vader&type=things&sort=relevant") as response:
      html = page.read()
      soup = BeautifulSoup(html, "lxml")
      print(soup.prettify()) # The HTML tag does not appear as it's generate by JavaScript.
      product_list = soup.find_all({ 'class' :'SearchResult__searchResultItem--c4VZk'})
      print(product_list)


Source: stackoverflow