I am trying to scrape all tr tags using BeautifulSoup, but it returns none. Code:
JavaScript
x
10
10
1
from urllib.request import urlopen
2
from bs4 import BeautifulSoup
3
4
url = 'https://www.pro-football-reference.com/years/2020/defense_advanced.htm'
5
html = urlopen(url)
6
stats_page = BeautifulSoup(html, "lxml")
7
8
column_headers = stats_page.findAll('tr')[0] #Line that returns none and throws IndexError
9
column_headers = [i.getText() for i in column_headers.findAll('th')]
10
Even though there are tr tags in this url, it returns none and throws an IndexError
. Why is this happening?
Advertisement
Answer
In page source table is located inside comment. You need to extract comment content and then parse it as HTML:
JavaScript
1
11
11
1
from bs4 import BeautifulSoup
2
from bs4 import Comment
3
4
url = 'https://www.pro-football-reference.com/years/2020/defense_advanced.htm'
5
html = urlopen(url)
6
soup = BeautifulSoup(html, "lxml")
7
comment = soup.find(text=lambda text: isinstance(text, Comment) and 'class="table_outer_container"' in text)
8
stats_page = BeautifulSoup(comment, "lxml")
9
column_headers = stats_page.findAll('tr')[0]
10
column_headers = [i.getText() for i in column_headers.findAll('th')]
11