Extract HTML into JSON with pyhton BeautifulSoup

Question

The problem I&#8217;m trying to parse some blocks of HTML to store the relevant data in a JSON object but I&#8217;m struggling with the way BeautifulSoup&#8217;s treatment of child tags clashes with my specific requirements. Eample input: Desired output: My attempt Here&#8217;s my best attempt so far: Which p…

Accepted Answer

It’s not a beautifulsoup solution – but perhaps it would be easier to use an event-based parser instead such as lxml.etree.iterparse()You can register for start/end (open tag/close tag) events which can be a useful way of handling the parent/child nesting.import io, json, lxml.etreedef process(html): # convert html str into fileobj for iterparse html = io.BytesIO(html.encode('utf-8')) parser = lxml.etree.iterparse( html, events=('start', 'end'), html=True) root = None parents = [] for event, tag in parser: if event == 'start': content = [] if tag.text and tag.text.strip(): content.append(tag.text.strip()) child = dict(type=tag.tag, content=content) parents.append(child) if not root: root = child else: # close - point child to parent if len(parents) > 1: parent, child = parents[-2:] parent['content'].append(child) child = parents.pop() content = child['content'] # unwrap 1 element lists that contain a text only node if len(content) == 1 and isinstance(content[0], str): child['content'] = content.pop() # If the previous element is also a text only node # join text together and "discard" the "dict" if len(parent['content']) > 1 and isinstance(parent['content'][-2], str): parent['content'][-2] += ' ' + child['content'] parent['content'].pop() #root = root['content'][0]['content'] print(json.dumps(root, indent=4))iterparse adds tags – you can root = root['content'][0]['content'] or so if you want to exclude them.output:{ "type": "html", "content": [ { "type": "body", "content": [ { "type": "p", "content": "Here's a paragraph" }, { "type": "ul", "content": [ { "type": "li", "content": "With a list" }, { "type": "li", "content": [ { "type": "ul", "content": [ { "type": "li", "content": "And a nested list" }, { "type": "li", "content": "Within it that has some bold text" } ] } ] } ] } ] } ]}

The problem

My attempt

Advertisement

Answer