Skip to content
Advertisement

Beautifulsoup sibling structure with br tags

I’m trying to parse a HTML document using the BeautifulSoup Python library, but the structure is getting distorted by <br> tags. Let me just give you an example.

Input HTML:

JavaScript

HTML that BeautifulSoup interprets:

JavaScript

In the source, the spans could be considered siblings. After parsing (using the default parser), the spans are suddenly no longer siblings, as the br tags became part of the structure.

The solution I can think of to solve this is to strip the <br> tags altogether, before pouring the html into Beautifulsoup, but that doesn’t seem very elegant, as it requires me to change the input. What’s a better way to solve this?

Advertisement

Answer

Your best bet is to extract() the line breaks. It’s easier than you think :).

JavaScript
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement