I’m trying to extract every links with BeautifulSoup from the SEC website such as this one by using the code from this Github. The thing is I do not want to extract every 8-K but only the ones matching the items “2.02” within the column “Description”. So i edited the “Download.py” file and identified the following :
JavaScript
x
17
17
1
while continuation_tag:
2
r = requests_get(browse_url, params=requests_params)
3
if continuation_tag == 'first pass':
4
logger.debug("EDGAR search URL: " + r.url)
5
logger.info('-' * 100)
6
data = r.text
7
soup = BeautifulSoup(data, "html.parser")
8
for link in soup.find_all('a', {'id': 'documentsbutton'}):
9
URL = sec_website + link['href']
10
linkList.append(URL)
11
continuation_tag = soup.find('input', {'value': 'Next ' + str(count)}) # a button labelled 'Next 100' for example
12
if continuation_tag:
13
continuation_string = continuation_tag['onclick']
14
browse_url = sec_website + re.findall('cgi-bin.*count=d*', continuation_string)[0]
15
requests_params = None
16
return linkList
17
I’ve tried to add another loop to match my regex but it doesn’t work
JavaScript
1
5
1
for link in soup.find_all('a', {'id': 'documentsbutton'}):
2
for link in soup.find_all(string=re.compile("items 2.02")):
3
URL = sec_website + link['href']
4
linkList.append(URL)
5
Any helps would be really appreciated, thanks !
Advertisement
Answer
First find the tr
that encapsulates both the a
tag and the td
tag that contains the items 2.02
text. Then find the url in the tr
if the td
actually contains the text items 2.02
:
JavaScript
1
7
1
for link in soup.find_all("tr"):
2
td = link.find('td', {'class': 'small'})
3
if td:
4
if 'items 2.02' in td.text:
5
URL = sec_website + link.find('a', {'id': 'documentsbutton'})['href']
6
linkList.append(URL)
7