Skip to content
Advertisement

How can I get Html of a website as seen on browser?

A website loads a part of the site after the site is opened, when I use libraries such as request and urllib3, I cannot get the part that is loaded later, how can I get the html of this website as seen in the browser. I can’t open a browser using Selenium and get html because this process should not slow down with the browser.

I tried htppx, httplib2, urllib, urllib3 but I couldn’t get the later loaded section.

Advertisement

Answer

You can use the BeautifulSoup library or Selenium to simulate a user-like page loading and waiting to load additional HTML elements.

I would suggest using Selenium since it contains the WebDriverWait Class that can help you scrape the additional HTML elements.

This is my simple example:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Replace with the URL of the website you want
url = "https://www.example.com"

# Adding the option for headless browser
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(options=options)

# Create a new instance of the Chrome webdriver
driver = webdriver.Chrome()

driver.get(url)

# Wait for the additional HTML elements to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.XPATH, "//*[contains(@class, 'lazy-load')]")))

# Get  HTML 
html = driver.page_source

print(html)

driver.close()

In the example above you can see that I’m using an explicit wait to wait (10secs) for a specific condition to occur. More specifically, I’m waiting until the element with the ‘lazy-load’ class is located By.XPath and then I retrieve the HTML elements.

Finally, I would recommend checking both BeautifulSoup and Selenium since both have tremendous capabilities for scrapping websites and automating web-based tasks.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement