Skip to content
Advertisement

Python & Beautiful Soup – Extract text between a specific tag and class combination

I’m new to using Beautiful Soup and web scraping in general; I’m trying to build a dataframe that has the title, content, and publish date from a blog post style website (everything’s on one page, there’s a title, publish date, and then the post’s content). I’m able to get the title and publish date easily enough, but I can’t correctly pull the post’s content. each post is structured like so:

JavaScript

Current Code:

JavaScript

I think I see the problem, that it’s pulling each paragraph as a separate news entry, but I haven’t been able to correct that yet. How would I pull the content that is only in between each tag=’h2′ and class=’thisYear’ combination?

Advertisement

Answer

You can use for example tag.find_previous to find to which block the paragraph belongs:

JavaScript

Prints:

JavaScript

EDIT: To transform out as a DataFrame you can do:

JavaScript

Prints:

JavaScript
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement