Skip to content
Advertisement

Extract HTML into JSON with pyhton BeautifulSoup

The problem

I’m trying to parse some blocks of HTML to store the relevant data in a JSON object but I’m struggling with the way BeautifulSoup’s treatment of child tags clashes with my specific requirements.

Eample input:

JavaScript

Desired output:

JavaScript

My attempt

Here’s my best attempt so far:

JavaScript

Which produces the following output:

JavaScript

You can see I have three issues:

  1. The inner list appears twice
  2. The inner list is not nested within it’s parent list
  3. The text enclosed within the tags is lost

I know it’s a bit of a bizarre thing to do to HTML, but any suggestions on how to resolve these three points?

Advertisement

Answer

It’s not a beautifulsoup solution – but perhaps it would be easier to use an event-based parser instead such as lxml.etree.iterparse()

You can register for start/end (open tag/close tag) events which can be a useful way of handling the parent/child nesting.

JavaScript

iterparse adds <html><body> tags – you can root = root['content'][0]['content'] or so if you want to exclude them.

output:

JavaScript
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement