some websites automatically decline requests due to lack of user-agent, and it’s a hassle using bs4 to scrape many different types of tables.
This issue was resolved before through this code:
JavaScript
x
6
1
url = 'http://finance.yahoo.com/quote/A/key-statistics?p=A'
2
opener = urllib2.build_opener()
3
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
4
response = opener.open(url)
5
tables = pd.read_html(response.read()
6
However urllib2 has been depreciated and urllib3 doesn’t have a build_opener() attribute, and I could not find an equivalent attribute either even though I’m sure it has one.
Advertisement
Answer
read_html()
accepts a URL and string, so u can set headers on request, and pandas ll read this resoponse like a text:
JavaScript
1
9
1
import pandas as pd
2
import requests
3
4
5
url = 'http://finance.yahoo.com/quote/A/key-statistics?p=A'
6
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
7
tables = pd.read_html(response.text)
8
print(tables)
9
If u open read_html()
none of the options accept headers as an argument, so just set headers in request