How can I scrape an apple HTML page using python?

I am trying to scrape the h2 tag below from the apple page in the python 3.10.6 code further below. I can see the h2 tag on the page; but my python running on PyCharm 2022.1.4 is unable to scrape it. episode-shelf-header is a unique class in the html code on this page.

I did search for a solution to this but was unable to find one.

Can anyone help?

<div class="episode-shelf-header" id="{{@model.id}}-{{@shelf.id}}">
    <h2 class="typ-headline-emph">
        Season 1
    </h2>
</div>

JavaScript
​x
 
<div class="episode-shelf-header" id="{{@model.id}}-{{@shelf.id}}">
    <h2 class="typ-headline-emph">
        Season 1
    </h2>
</div>
​

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://tv.apple.com/us/show/life-by-ella/umc.cmc.1suiyueh1ntwjtsstcwldofno?ctx_brand=tvs.sbd.4000')

pageSource = driver.page_source
soup = BeautifulSoup(pageSource, 'html.parser')
div = soup.find('div', attrs={'class': 'episode-shelf-header'})
h2 = div.find('h2', attrs={'class': 'typ-headline-emph'})

JavaScript
 
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
​
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://tv.apple.com/us/show/life-by-ella/umc.cmc.1suiyueh1ntwjtsstcwldofno?ctx_brand=tvs.sbd.4000')
​
pageSource = driver.page_source
soup = BeautifulSoup(pageSource, 'html.parser')
div = soup.find('div', attrs={'class': 'episode-shelf-header'})
h2 = div.find('h2', attrs={'class': 'typ-headline-emph'})
​

Answer

Value can be extracted directly from Selenium.
You must wait for the page to fully load.

There is a sample code to extract the final value.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://tv.apple.com/us/show/life-by-ella/umc.cmc.1suiyueh1ntwjtsstcwldofno?ctx_brand=tvs.sbd.4000')
x_path = '//*[@id="{{@model.id}}-{{@shelf.id}}"]/h2'
element = WebDriverWait(driver, 10).until(lambda x: x.find_element(By.XPATH, x_path))

print(element.text)

JavaScript
 
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
​
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://tv.apple.com/us/show/life-by-ella/umc.cmc.1suiyueh1ntwjtsstcwldofno?ctx_brand=tvs.sbd.4000')
x_path = '//*[@id="{{@model.id}}-{{@shelf.id}}"]/h2'
element = WebDriverWait(driver, 10).until(lambda x: x.find_element(By.XPATH, x_path))
​
print(element.text)
​

note: selenium version: selenium 4.3.0

Advertisement

Answer