I’m trying to extract every links with BeautifulSoup from the SEC website such as this one by using the code from this Github. The thing is I do not want to extract every 8-K but only the ones matching the items “2.02” within the column “Description”. So i edited the “Download.py” file and identified the following :
while continuation_tag: r = requests_get(browse_url, params=requests_params) if continuation_tag == 'first pass': logger.debug("EDGAR search URL: " + r.url) logger.info('-' * 100) data = r.text soup = BeautifulSoup(data, "html.parser") for link in soup.find_all('a', {'id': 'documentsbutton'}): URL = sec_website + link['href'] linkList.append(URL) continuation_tag = soup.find('input', {'value': 'Next ' + str(count)}) # a button labelled 'Next 100' for example if continuation_tag: continuation_string = continuation_tag['onclick'] browse_url = sec_website + re.findall('cgi-bin.*count=d*', continuation_string)[0] requests_params = None return linkList
I’ve tried to add another loop to match my regex but it doesn’t work
for link in soup.find_all('a', {'id': 'documentsbutton'}): for link in soup.find_all(string=re.compile("items 2.02")): URL = sec_website + link['href'] linkList.append(URL)
Any helps would be really appreciated, thanks !
Advertisement
Answer
First find the tr
that encapsulates both the a
tag and the td
tag that contains the items 2.02
text. Then find the url in the tr
if the td
actually contains the text items 2.02
:
for link in soup.find_all("tr"): td = link.find('td', {'class': 'small'}) if td: if 'items 2.02' in td.text: URL = sec_website + link.find('a', {'id': 'documentsbutton'})['href'] linkList.append(URL)