I’m trying to extract every links with BeautifulSoup from the SEC website such as this one by using the code from this Github. The thing is I do not want to extract every 8-K but only the ones matching the items “2.02” within the column “Description”. So i edited the “Download.py” file and identified the following :
while continuation_tag:
r = requests_get(browse_url, params=requests_params)
if continuation_tag == 'first pass':
logger.debug("EDGAR search URL: " + r.url)
logger.info('-' * 100)
data = r.text
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a', {'id': 'documentsbutton'}):
URL = sec_website + link['href']
linkList.append(URL)
continuation_tag = soup.find('input', {'value': 'Next ' + str(count)}) # a button labelled 'Next 100' for example
if continuation_tag:
continuation_string = continuation_tag['onclick']
browse_url = sec_website + re.findall('cgi-bin.*count=d*', continuation_string)[0]
requests_params = None
return linkList
I’ve tried to add another loop to match my regex but it doesn’t work
for link in soup.find_all('a', {'id': 'documentsbutton'}):
for link in soup.find_all(string=re.compile("items 2.02")):
URL = sec_website + link['href']
linkList.append(URL)
Any helps would be really appreciated, thanks !
Advertisement
Answer
First find the tr that encapsulates both the a tag and the td tag that contains the items 2.02 text. Then find the url in the tr if the td actually contains the text items 2.02:
for link in soup.find_all("tr"):
td = link.find('td', {'class': 'small'})
if td:
if 'items 2.02' in td.text:
URL = sec_website + link.find('a', {'id': 'documentsbutton'})['href']
linkList.append(URL)