I am trying to scrape all tr tags using BeautifulSoup, but it returns none. Code:
from urllib.request import urlopen from bs4 import BeautifulSoup url = 'https://www.pro-football-reference.com/years/2020/defense_advanced.htm' html = urlopen(url) stats_page = BeautifulSoup(html, "lxml") column_headers = stats_page.findAll('tr')[0] #Line that returns none and throws IndexError column_headers = [i.getText() for i in column_headers.findAll('th')]
Even though there are tr tags in this url, it returns none and throws an IndexError
. Why is this happening?
Advertisement
Answer
In page source table is located inside comment. You need to extract comment content and then parse it as HTML:
from bs4 import BeautifulSoup from bs4 import Comment url = 'https://www.pro-football-reference.com/years/2020/defense_advanced.htm' html = urlopen(url) soup = BeautifulSoup(html, "lxml") comment = soup.find(text=lambda text: isinstance(text, Comment) and 'class="table_outer_container"' in text) stats_page = BeautifulSoup(comment, "lxml") column_headers = stats_page.findAll('tr')[0] column_headers = [i.getText() for i in column_headers.findAll('th')]