Skip to content
Advertisement

Trying to get only the text between two strong tags

I am currently trying to get only the HTML text (a list of names) that is between the first two occurrences of the strong tag.

Here is a short example of the HTML I scraped

JavaScript

Hers is some quick code that I wrote with the basic logic of counting the number of strong tags occurring. I know after the second occurrence all the names that I want have been collected

JavaScript

The loop works and collects all the text before the first strong tag and everything after up until the next occurrence of the strong tag. i.e.

JavaScript

This basically does what I want, but I lose some of the functionality of a BeautifulSoup object since it is now a list. Is there a BeautifulSoup function that can help me do this or another option? Or should I focus on making this loop more efficient before I scale it up to multiple articles?

Advertisement

Answer

To answer the question as is, leaving the opportunity to scrape the “Title of Article” and “Footnotes”. You can use findChildren() then decompose() to remove unwanted elements. From the output of this code you can extract the data you need quite easily. It works even if the text “PRESENT” and “Section Header” are not present. It can easily be adapted to remove elements before the first “Strong” tag if needed.

JavaScript

Outputs:

JavaScript
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement