I’m trying to extract every links with BeautifulSoup from the SEC website such as this one by using the code from this Github. The thing is I do not want to extract every 8-K but only the ones matching the items “2.02” within the column “Description”. So i edited the “Download.py” file and identified the following :
    while continuation_tag:
        r = requests_get(browse_url, params=requests_params)
        if continuation_tag == 'first pass':
            logger.debug("EDGAR search URL: " + r.url)
            logger.info('-' * 100)
        data = r.text
        soup = BeautifulSoup(data, "html.parser")
        for link in soup.find_all('a', {'id': 'documentsbutton'}):   
            URL = sec_website + link['href']
            linkList.append(URL)
        continuation_tag = soup.find('input', {'value': 'Next ' + str(count)}) # a button labelled 'Next 100' for example
        if continuation_tag:
            continuation_string = continuation_tag['onclick']
            browse_url = sec_website + re.findall('cgi-bin.*count=d*', continuation_string)[0]
            requests_params = None
    return linkList
I’ve tried to add another loop to match my regex but it doesn’t work
for link in soup.find_all('a', {'id': 'documentsbutton'}):
    for link in soup.find_all(string=re.compile("items 2.02")):
        URL = sec_website + link['href']
        linkList.append(URL)
Any helps would be really appreciated, thanks !
Advertisement
Answer
First find the tr that encapsulates both the a tag and the td tag that contains the items 2.02 text. Then find the url in the tr if the td actually contains the text items 2.02:
for link in soup.find_all("tr"):
    td = link.find('td', {'class': 'small'})
    if td:
        if 'items 2.02' in td.text:
            URL = sec_website + link.find('a', {'id': 'documentsbutton'})['href']
            linkList.append(URL)
