I am currently trying to get only the HTML text (a list of names) that is between the first two occurrences of the strong tag.
Here is a short example of the HTML I scraped
<h3>Title of Article</h3> <p><strong>Section Header 1</strong></p> <p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p> <p>PRESENT:</p> <p>John Smith, Farmer<br/> William Dud, Bum<br/> Luke Brain, Terrible Singer<br/> Charles Evans, Doctor<br/> Stanley Fish, Fisher</p> <p>George Jungle, Savage</p> .... .... .... <p>William, Baller</p> <p>Roy Williams, Coach</p> <p><strong>Section Header 2</strong><br/> A second paragraph with lots of text and footnotes</p> .... .... .... .... ....
Hers is some quick code that I wrote with the basic logic of counting the number of strong tags occurring. I know after the second occurrence all the names that I want have been collected
html = requests.get('https://www.somewebsite.com') soup = BS(html.text, 'html.parser') #Pull only the HTML from the article that I am interested in notes = soup.find('div', attrs = {'id' : 'article'}) # Define a function to print true if a string contains <strong> def findstrong(i): return "</strong>" in i # initialize a value for strong, after the second strong I know all the # names I am interested in have been collected strong_counts = 0 list_of_names = [] for i in range(len(notes)): if strong_counts < 2: note = notes.contents[i] #make note string so we can use the findstrong function note_2_str = str(note) if findstrong(note_2_str) == False: list_of_names.append(note) else: strong_counts += 1
The loop works and collects all the text before the first strong tag and everything after up until the next occurrence of the strong tag. i.e.
<h3>Title of Article</h3> <p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p> <p>PRESENT:</p> <p>John Smith, Farmer<br/> William Dud, Bum<br/> Luke Brain, Terrible Singer<br/> Charles Evans, Doctor<br/> Stanley Fish, Fisher</p> <p>George Jungle, Savage</p> .... .... .... <p>William, Baller</p> <p>Roy Williams, Coach</p>
This basically does what I want, but I lose some of the functionality of a BeautifulSoup object since it is now a list. Is there a BeautifulSoup function that can help me do this or another option? Or should I focus on making this loop more efficient before I scale it up to multiple articles?
Advertisement
Answer
To answer the question as is, leaving the opportunity to scrape the “Title of Article” and “Footnotes”. You can use findChildren() then decompose() to remove unwanted elements. From the output of this code you can extract the data you need quite easily. It works even if the text “PRESENT” and “Section Header” are not present. It can easily be adapted to remove elements before the first “Strong” tag if needed.
from bs4 import BeautifulSoup, element html = """ <div><p> blah blah</p></div> <div id="article"> <h3>Title of Article</h3> <p><strong>Section Header 1</strong></p> <p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p> <p>PRESENT:</p> <p>John Smith, Farmer<br/> William Dud, Bum<br/> Luke Brain, Terrible Singer<br/> Charles Evans, Doctor<br/> Stanley Fish, Fisher</p> <p>George Jungle, Savage</p> <p>William, Baller</p> <p>Roy Williams, Coach</p> <p><strong>Section Header 2</strong><br/> A second paragraph with lots of text and footnotes</p> <p> blah blah</p> </div> """ soup = BeautifulSoup(html, 'html.parser') # Pull only the HTML from the article that I am interested in notes = soup.find('div', attrs = {'id' : 'article'}) counter = 0 # Iterate over children. for i in notes.findChildren(): if i.name == "strong": counter += 1 if counter == 2: i.parent.decompose() # Remove the second Strong tag's parent. if counter > 1: # Remove all tags after second Strong tag. if isinstance(i, element.Tag): i.decompose() print(notes)
Outputs:
<div id="article"> <h3>Title of Article</h3> <p><strong>Section Header 1</strong></p> <p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p> <p>PRESENT:</p> <p>John Smith, Farmer<br/> William Dud, Bum<br/> Luke Brain, Terrible Singer<br/> Charles Evans, Doctor<br/> Stanley Fish, Fisher</p> <p>George Jungle, Savage</p> <p>William, Baller</p> <p>Roy Williams, Coach</p> </div>