I’m trying to build a webscraper for a hungarian e-commerce site called https://www.arukereso.hu.
from bs4 import BeautifulSoup as soup import requests #The starting values #url = input("Illeszd ide egy Árukeresős keresésnek a linkjét: ") url = 'https://www.arukereso.hu/notebook-c3100/' headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'} page_num = 1 allproducts = [] #Defining functions for better readability def nextpage(): further_pages = usefulsoup.find("div", class_="pagination hidden-xs") nextpage_num = page_num + 1 try: next_page = further_pages.find("a", string=str(nextpage_num)) next_page = next_page['href'] return next_page except: return None while True: if url == None: break r = requests.get(url, headers=headers) page_html = r.content r.close() soup = soup(page_html, "html.parser") #print(soup) usefulsoup = soup.find("div", id="product-list") #print(usefulsoup) products = usefulsoup.find_all("div", class_="product-box-container clearfix") print(products) for product in products: allproducts.append(product) url = nextpage() print(allproducts)
The problem is that when the nextpage()
function is first called, it returns a valid link (https://www.arukereso.hu/notebook-c3100/?start=25), the request’s content is also valid html, but BeautifulSoup makes an empty list out of it, therefore the program ends with an error.
I would be grateful, if someone could explain the reason for this and how to fix it.
Advertisement
Answer
The problem in your code is following line:
soup = soup(page_html, "html.parser")
When loop runs first time it works because soup
name is not overwritten yet. Next time it runs the soup method from package is overwritten and that is why you have the problem. Rename this variable and it should work. I have tested it.