I’m trying to fetch each heading and their corresponding paragraphs from the html elements below. The results should be stored within a dictionary. Whatever I’ve tried so far produces ludicrously haphazard output. I intentionally did not paste the current output only because of brevity of space.
html = """ <h1>a Complexity Profile</h1> <p>Since time immemorial humans have...</p> <p>How often have we been told</p> <h2>INDIVIDUAL AND COLLECTIVE BEHAVIOR</h2> <p>Building a model of society based...</p> <p>All macroscopic systems...</p> <h3>COMPLEXITY PROFILE</h3> <p>It is much easier to think about the...</p> <p>A formal definition of scale considers...</p> <p>The complexity profile counts...</p> <h2>CONTROL IN HUMAN ORGANIZATIONS</h2> <p>Using this argument it is straightforward...</p> <h2>Conclusion</h2> <p>There are two natural conclusions...</p> """
I’ve tried with (producing messy output):
import json from bs4 import BeautifulSoup soup = BeautifulSoup(html,"lxml") data = [] for item in soup.select("h1,h2,h3,h4,h5,h6"): d = {} d['title'] = item.text d['text'] = [i.text for i in item.find_next_siblings('p')] data.append(d) print(json.dumps(data,indent=4))
Output I wish to get:
[ { "title": "a Complexity Profile", "text": [ "Since time immemorial humans have...", "How often have we been told", ] }, { "title": "INDIVIDUAL AND COLLECTIVE BEHAVIOR", "text": [ "Building a model of society based...", "All macroscopic systems...", ] }, { "title": "COMPLEXITY PROFILE", "text": [ "It is much easier to think about the...", "A formal definition of scale considers...", "The complexity profile counts...", ] }, { "title": "CONTROL IN HUMAN ORGANIZATIONS", "text": [ "Using this argument it is straightforward...", ] }, { "title": "Conclusion", "text": [ "There are two natural conclusions..." ] } ]
Advertisement
Answer
Tricky problem. I think you have to handle things linearly:
import json from bs4 import BeautifulSoup soup = BeautifulSoup(html,"lxml") data = [] pending = {} for item in soup.select("h1,h2,h3,h4,h5,h6,p"): if item.name == 'p': pending['text'].append( item.text ) else: if pending: data.append(pending) pending = {'title': item.text, 'text': [] } data.append( pending ) print(json.dumps(data,indent=4))
Output:
timr@tims-gram:~/src$ python x.py [ { "title": "a Complexity Profile", "text": [ "Since time immemorial humans have...", "How often have we been told" ] }, { "title": "INDIVIDUAL AND COLLECTIVE BEHAVIOR", "text": [ "Building a model of society based...", "All macroscopic systems..." ] }, { "title": "COMPLEXITY PROFILE", "text": [ "It is much easier to think about the...", "A formal definition of scale considers...", "The complexity profile counts..." ] }, { "title": "CONTROL IN HUMAN ORGANIZATIONS", "text": [ "Using this argument it is straightforward..." ] }, { "title": "Conclusion", "text": [ "There are two natural conclusions..." ] } ] timr@tims-gram:~/src$