Skip to content
Advertisement

Trying to get only the text between two strong tags

I am currently trying to get only the HTML text (a list of names) that is between the first two occurrences of the strong tag.

Here is a short example of the HTML I scraped

<h3>Title of Article</h3>

<p><strong>Section Header 1</strong></p>

<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>

<p>PRESENT:</p>

<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>

....
....
....

<p>William, Baller</p>

<p>Roy Williams, Coach</p>

<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
....
....
....
....
....


Hers is some quick code that I wrote with the basic logic of counting the number of strong tags occurring. I know after the second occurrence all the names that I want have been collected

html = requests.get('https://www.somewebsite.com')
soup = BS(html.text, 'html.parser')

#Pull only the HTML from the article that I am interested in 
notes = soup.find('div', attrs = {'id' : 'article'})


# Define a function to print true if a string contains <strong>
def findstrong(i):
    return "</strong>" in i


# initialize a value for strong, after the second strong I know all the 
# names I am interested in have been collected 
strong_counts = 0



list_of_names = []
for i in range(len(notes)):

    if strong_counts < 2:

        note = notes.contents[i]
        #make note string so we can use the findstrong function
        note_2_str = str(note)

        if findstrong(note_2_str) == False:
            list_of_names.append(note)
        else:
            strong_counts += 1    

The loop works and collects all the text before the first strong tag and everything after up until the next occurrence of the strong tag. i.e.

<h3>Title of Article</h3>

<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>

<p>PRESENT:</p>

<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>

....
....
....

<p>William, Baller</p>

<p>Roy Williams, Coach</p>

This basically does what I want, but I lose some of the functionality of a BeautifulSoup object since it is now a list. Is there a BeautifulSoup function that can help me do this or another option? Or should I focus on making this loop more efficient before I scale it up to multiple articles?

Advertisement

Answer

To answer the question as is, leaving the opportunity to scrape the “Title of Article” and “Footnotes”. You can use findChildren() then decompose() to remove unwanted elements. From the output of this code you can extract the data you need quite easily. It works even if the text “PRESENT” and “Section Header” are not present. It can easily be adapted to remove elements before the first “Strong” tag if needed.

from bs4 import BeautifulSoup, element

html = """
<div><p> blah blah</p></div>
<div id="article">
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
<p> blah blah</p>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')
# Pull only the HTML from the article that I am interested in
notes = soup.find('div', attrs = {'id' : 'article'})
counter = 0
# Iterate over children.
for i in notes.findChildren():
    if i.name == "strong":
        counter += 1
        if counter == 2:
            i.parent.decompose()  # Remove the second Strong tag's parent.
    if counter > 1:  # Remove all tags after second Strong tag.
        if isinstance(i, element.Tag):
            i.decompose()
print(notes)

Outputs:

<div id="article">
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>


</div>
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement