BeautifulSoup returns empty list with valid html content

I’m trying to build a webscraper for a hungarian e-commerce site called https://www.arukereso.hu.

from bs4 import BeautifulSoup as soup
import requests

#The starting values
#url = input("Illeszd ide egy Árukeresős keresésnek a linkjét: ")
url = 'https://www.arukereso.hu/notebook-c3100/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
page_num = 1
allproducts = []

#Defining functions for better readability
def nextpage():
    further_pages = usefulsoup.find("div", class_="pagination hidden-xs")
    nextpage_num = page_num + 1
    try:
        next_page = further_pages.find("a", string=str(nextpage_num))
        next_page = next_page['href']
        return next_page
    except:
        return None

while True:
    if url == None:
        break
    r = requests.get(url, headers=headers)
    page_html = r.content
    r.close()

    soup = soup(page_html, "html.parser")
    #print(soup)
    usefulsoup = soup.find("div", id="product-list")
    #print(usefulsoup)

    products = usefulsoup.find_all("div", class_="product-box-container clearfix")
    print(products)
    for product in products:
        allproducts.append(product)

    url = nextpage()

print(allproducts)

JavaScript
​x
 
from bs4 import BeautifulSoup as soup
import requests
​
#The starting values
#url = input("Illeszd ide egy Árukeresős keresésnek a linkjét: ")
url = 'https://www.arukereso.hu/notebook-c3100/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
page_num = 1
allproducts = []
​
#Defining functions for better readability
def nextpage():
    further_pages = usefulsoup.find("div", class_="pagination hidden-xs")
    nextpage_num = page_num + 1
    try:
        next_page = further_pages.find("a", string=str(nextpage_num))
        next_page = next_page['href']
        return next_page
    except:
        return None
​
while True:
    if url == None:
        break
    r = requests.get(url, headers=headers)
    page_html = r.content
    r.close()
​
    soup = soup(page_html, "html.parser")
    #print(soup)
    usefulsoup = soup.find("div", id="product-list")
    #print(usefulsoup)
​
    products = usefulsoup.find_all("div", class_="product-box-container clearfix")
    print(products)
    for product in products:
        allproducts.append(product)
​
    url = nextpage()
​
print(allproducts)
​

The problem is that when the nextpage() function is first called, it returns a valid link (https://www.arukereso.hu/notebook-c3100/?start=25), the request’s content is also valid html, but BeautifulSoup makes an empty list out of it, therefore the program ends with an error.

I would be grateful, if someone could explain the reason for this and how to fix it.

Answer

The problem in your code is following line:

soup = soup(page_html, "html.parser")

JavaScript
 
soup = soup(page_html, "html.parser")
​

When loop runs first time it works because soup name is not overwritten yet. Next time it runs the soup method from package is overwritten and that is why you have the problem. Rename this variable and it should work. I have tested it.

Advertisement

Answer