I’m trying to extract the data from this website(‘https://alliedoffsets.com/#/profile/2). It has many such projects and I want to get the values of Estimated Average Wholesale Price and Estimated Annual Emission Reduction. When, I trying to print the code using beautiful soup it is not giving those tags and giving empty values. I know it could be a basic thing but I’m stuck. May be the data is getting populated on the website using javascript but I cannot figure out a way to do it.
JavaScript
x
12
12
1
import pandas as pd
2
import requests
3
from bs4 import BeautifulSoup
4
5
url='https://alliedoffsets.com/#/profile/1'
6
r=requests.get(url)
7
url=r.content
8
soup = BeautifulSoup(url,'html.parser')
9
10
tab=soup.find("thead",{"class":"sr-only"})
11
print(tab)
12
Advertisement
Answer
The web page is rendered in JavaScript so the HTML elements cannot be extracted directly using BeautifulSoup. Selenium can be used to extract the rendered HTML then search for elements by ID, class, XPath, etc.
JavaScript
1
29
29
1
from selenium import webdriver
2
from selenium.webdriver.chrome.service import Service
3
from webdriver_manager.chrome import ChromeDriverManager
4
from selenium.webdriver.common.by import By
5
from selenium.webdriver.support import expected_conditions as EC
6
from selenium.webdriver.support.ui import WebDriverWait
7
import re
8
9
url = 'https://alliedoffsets.com/#/profile/1'
10
11
s = Service(ChromeDriverManager().install())
12
driver = webdriver.Chrome(service=s)
13
14
# web driver goes to page
15
driver.get(url)
16
17
# use WebDriverWait to wait until page is rendered
18
19
# find Estimated Average Wholesale Price
20
elt = WebDriverWait(driver, 10).until(
21
EC.presence_of_element_located((By.ID, 'direct-price-panel'))
22
)
23
# extract just the price from the text
24
print(re.sub(r'.*($S+).*', r'1', elt.text))
25
26
# find Estimated Annual Emission Reduction
27
elt = driver.find_element(By.XPATH, "//*[strong[contains(., 'Estimated Annual Emission Reduction')]]")
28
print(elt.text.split(":")[1])
29
Output:
JavaScript
1
3
1
$5.06
2
11603 tCO2
3