Beautifulsoup : Unable to extract href with several conditions

I’m trying to extract every links with BeautifulSoup from the SEC website such as this one by using the code from this Github. The thing is I do not want to extract every 8-K but only the ones matching the items “2.02” within the column “Description”. So i edited the “Download.py” file and identified the following :

    while continuation_tag:
        r = requests_get(browse_url, params=requests_params)
        if continuation_tag == 'first pass':
            logger.debug("EDGAR search URL: " + r.url)
            logger.info('-' * 100)
        data = r.text
        soup = BeautifulSoup(data, "html.parser")
        for link in soup.find_all('a', {'id': 'documentsbutton'}):   
            URL = sec_website + link['href']
            linkList.append(URL)
        continuation_tag = soup.find('input', {'value': 'Next ' + str(count)}) # a button labelled 'Next 100' for example
        if continuation_tag:
            continuation_string = continuation_tag['onclick']
            browse_url = sec_website + re.findall('cgi-bin.*count=d*', continuation_string)[0]
            requests_params = None
    return linkList

JavaScript
​x
 
    while continuation_tag:
        r = requests_get(browse_url, params=requests_params)
        if continuation_tag == 'first pass':
            logger.debug("EDGAR search URL: " + r.url)
            logger.info('-' * 100)
        data = r.text
        soup = BeautifulSoup(data, "html.parser")
        for link in soup.find_all('a', {'id': 'documentsbutton'}):   
            URL = sec_website + link['href']
            linkList.append(URL)
        continuation_tag = soup.find('input', {'value': 'Next ' + str(count)}) # a button labelled 'Next 100' for example
        if continuation_tag:
            continuation_string = continuation_tag['onclick']
            browse_url = sec_website + re.findall('cgi-bin.*count=d*', continuation_string)[0]
            requests_params = None
    return linkList
​

I’ve tried to add another loop to match my regex but it doesn’t work

for link in soup.find_all('a', {'id': 'documentsbutton'}):
    for link in soup.find_all(string=re.compile("items 2.02")):
        URL = sec_website + link['href']
        linkList.append(URL)

JavaScript
 
for link in soup.find_all('a', {'id': 'documentsbutton'}):
    for link in soup.find_all(string=re.compile("items 2.02")):
        URL = sec_website + link['href']
        linkList.append(URL)
​

Any helps would be really appreciated, thanks !

Answer

First find the tr that encapsulates both the a tag and the td tag that contains the items 2.02 text. Then find the url in the tr if the td actually contains the text items 2.02:

for link in soup.find_all("tr"):
    td = link.find('td', {'class': 'small'})
    if td:
        if 'items 2.02' in td.text:
            URL = sec_website + link.find('a', {'id': 'documentsbutton'})['href']
            linkList.append(URL)

JavaScript
 
for link in soup.find_all("tr"):
    td = link.find('td', {'class': 'small'})
    if td:
        if 'items 2.02' in td.text:
            URL = sec_website + link.find('a', {'id': 'documentsbutton'})['href']
            linkList.append(URL)
​

Advertisement

Answer