I am trying to scrape MSFT’s income statement using code I found here: How to Web scraping SEC Edgar 10-K Dynamic data
They use the ‘span’ class to narrow the search. I do not see a span, so I am trying to use the <p class with no luck.
Here is my code, it is largely unchanged from the answer given. I changed the base_url and tried to change soup.find to ‘p’. Is there a way to find the <p class or, even better, a way to find the income statement chart?
Here is the URL to the statement: https://www.sec.gov/Archives/edgar/data/789019/000156459018019062/msft-10k_20180630.htm
from bs4 import BeautifulSoup import requests headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"} # Obtain HTML for search page base_url = "https://www.sec.gov/Archives/edgar/data/789019/000156459018019062/msft-10k_20180630.htm" edgar_resp = requests.get(base_url, headers=headers) edgar_str = edgar_resp.text soup = BeautifulSoup(edgar_str, 'html.parser') s = soup.find('p', recursive=True, string='INCOME STATEMENTS ') t = s.find_next('table') trs = t.find_all('tr') for tr in trs: if tr.text: print(list(tr.stripped_strings))
Here is the code from the example:
from bs4 import BeautifulSoup import requests headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"} # Obtain HTML for search page base_url = "https://www.sec.gov/Archives/edgar/data/200406/000020040621000057/jnj-20210704.htm" edgar_resp = requests.get(base_url, headers=headers) edgar_str = edgar_resp.text soup = BeautifulSoup(edgar_str, 'html.parser') s = soup.find('span', recursive=True, string='SALES BY SEGMENT OF BUSINESS ') t = s.find_next('table') trs = t.find_all('tr') for tr in trs: if tr.text: print(list(tr.stripped_strings))
Thank you!
Advertisement
Answer
I’m not sure why that’s not working, but you can try this:
s = soup.find('a', attrs={'name':'INCOME_STATEMENTS'})
This should match the <a name="INCOME_STATEMENTS"></a>
element inside that paragraph.