I am very new in webscraping and I am trying to do a small project where I can scrape a website like Thingiverse or similar where different CAD (or similar) files are shown. I am trying to for a particular Search keyword obtain a list of all the results. When I inspect the website the different products are highlighted in this part of the HTML :
<div class="SearchResult__searchResultItem--c4VZk">
However when I go into my script and type:
29/11 Edited:
from bs4 import BeautifulSoup page = urlopen("https://www.thingiverse.com/search?q=vader&type=things&sort=relevant") soup = BeautifulSoup(page, "lxml") product_list = soup.find_all({ 'class' :'SearchResult__searchResultItem--c4VZk'})
I get a 0 sized list. What am I doing wrong?
Advertisement
Answer
For the original question:
Class
is passed as a dictionary items. Therefore change code to soup.find_all('div', { 'class' :'SearchResult__searchResultItem--c4VZk'})
This demo BeautifulSoup scraping the html:
from bs4 import BeautifulSoup html = '''<div class="SearchResult__searchResultItem--c4VZk">Test</div>''' soup=BeautifulSoup(html,'html.parser') Result_list = soup.find_all('div', { 'class' :'SearchResult__searchResultItem--c4VZk'}) print(Result_list)
Output:
[<div class="SearchResult__searchResultItem--c4VZk">Test</div>]
For your edited question:
BeautifulSoup(page, "lxml")
this passes in your response object and not your HTML. The response object will contain HTTP status, headers and all sorts of information. To get the HTML try html = page.read()
.
The website is loading html tags via JavaScript. Therefore urllib.request
/ BeautifulSoup
will not be able to extract the data. You can test this by printing out the html using print(soup.prettify())
. To get around this issue you can use some sort of web automation tool like selenium
.
Had the website returned the HTML as expected. The scrape code would look something like:
from urllib.request import urlopen from bs4 import BeautifulSoup with urlopen("https://www.thingiverse.com/search?q=vader&type=things&sort=relevant") as response: html = page.read() soup = BeautifulSoup(html, "lxml") print(soup.prettify()) # The HTML tag does not appear as it's generate by JavaScript. product_list = soup.find_all({ 'class' :'SearchResult__searchResultItem--c4VZk'}) print(product_list)