Skip to content
Advertisement

Can’t get tags when scraping data

I am trying to scrape all tr tags using BeautifulSoup, but it returns none. Code:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://www.pro-football-reference.com/years/2020/defense_advanced.htm'
html = urlopen(url)
stats_page = BeautifulSoup(html, "lxml")

column_headers = stats_page.findAll('tr')[0] #Line that returns none and throws IndexError
column_headers = [i.getText() for i in column_headers.findAll('th')]

Even though there are tr tags in this url, it returns none and throws an IndexError. Why is this happening?

Advertisement

Answer

In page source table is located inside comment. You need to extract comment content and then parse it as HTML:

from bs4 import BeautifulSoup
from bs4 import Comment

url = 'https://www.pro-football-reference.com/years/2020/defense_advanced.htm'
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
comment = soup.find(text=lambda text: isinstance(text, Comment) and 'class="table_outer_container"' in text)
stats_page = BeautifulSoup(comment, "lxml")
column_headers = stats_page.findAll('tr')[0]
column_headers = [i.getText() for i in column_headers.findAll('th')]
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement