A website loads a part of the site after the site is opened, when I use libraries such as request and urllib3, I cannot get the part that is loaded later, how can I get the html of this website as seen in the browser. I can’t open a browser using Selenium and get html because this process should not slow down with the browser.
I tried htppx, httplib2, urllib, urllib3 but I couldn’t get the later loaded section.
Advertisement
Answer
You can use the BeautifulSoup library or Selenium to simulate a user-like page loading and waiting to load additional HTML elements.
I would suggest using Selenium since it contains the WebDriverWait Class that can help you scrape the additional HTML elements.
This is my simple example:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC # Replace with the URL of the website you want url = "https://www.example.com" # Adding the option for headless browser options = webdriver.ChromeOptions() options.add_argument("headless") driver = webdriver.Chrome(options=options) # Create a new instance of the Chrome webdriver driver = webdriver.Chrome() driver.get(url) # Wait for the additional HTML elements to load wait = WebDriverWait(driver, 10) wait.until(EC.presence_of_all_elements_located((By.XPATH, "//*[contains(@class, 'lazy-load')]"))) # Get HTML html = driver.page_source print(html) driver.close()
In the example above you can see that I’m using an explicit wait to wait (10secs) for a specific condition to occur. More specifically, I’m waiting until the element with the ‘lazy-load’ class is located By.XPath and then I retrieve the HTML elements.
Finally, I would recommend checking both BeautifulSoup and Selenium since both have tremendous capabilities for scrapping websites and automating web-based tasks.