I’m trying to extract the data from this website(‘https://alliedoffsets.com/#/profile/2). It has many such projects and I want to get the values of Estimated Average Wholesale Price and Estimated Annual Emission Reduction. When, I trying to print the code using beautiful soup it is not giving those tags and giving empty values. I know it could be a basic thing but I’m stuck. May be the data is getting populated on the website using javascript but I cannot figure out a way to do it.
import pandas as pd import requests from bs4 import BeautifulSoup url='https://alliedoffsets.com/#/profile/1' r=requests.get(url) url=r.content soup = BeautifulSoup(url,'html.parser') tab=soup.find("thead",{"class":"sr-only"}) print(tab)
Advertisement
Answer
The web page is rendered in JavaScript so the HTML elements cannot be extracted directly using BeautifulSoup. Selenium can be used to extract the rendered HTML then search for elements by ID, class, XPath, etc.
from selenium import webdriver from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait import re url = 'https://alliedoffsets.com/#/profile/1' s = Service(ChromeDriverManager().install()) driver = webdriver.Chrome(service=s) # web driver goes to page driver.get(url) # use WebDriverWait to wait until page is rendered # find Estimated Average Wholesale Price elt = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, 'direct-price-panel')) ) # extract just the price from the text print(re.sub(r'.*($S+).*', r'1', elt.text)) # find Estimated Annual Emission Reduction elt = driver.find_element(By.XPATH, "//*[strong[contains(., 'Estimated Annual Emission Reduction')]]") print(elt.text.split(":")[1])
Output:
$5.06 11603 tCO2