I am trying to scrape all tr tags using BeautifulSoup, but it returns none. Code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.pro-football-reference.com/years/2020/defense_advanced.htm'
html = urlopen(url)
stats_page = BeautifulSoup(html, "lxml")
column_headers = stats_page.findAll('tr')[0] #Line that returns none and throws IndexError
column_headers = [i.getText() for i in column_headers.findAll('th')]
Even though there are tr tags in this url, it returns none and throws an IndexError. Why is this happening?
Advertisement
Answer
In page source table is located inside comment. You need to extract comment content and then parse it as HTML:
from bs4 import BeautifulSoup
from bs4 import Comment
url = 'https://www.pro-football-reference.com/years/2020/defense_advanced.htm'
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
comment = soup.find(text=lambda text: isinstance(text, Comment) and 'class="table_outer_container"' in text)
stats_page = BeautifulSoup(comment, "lxml")
column_headers = stats_page.findAll('tr')[0]
column_headers = [i.getText() for i in column_headers.findAll('th')]