I’m trying to fetch each heading and their corresponding paragraphs from the html elements below. The results should be stored within a dictionary. Whatever I’ve tried so far produces ludicrously haphazard output. I intentionally did not paste the current output only because of brevity of space.
html = """ <h1>a Complexity Profile</h1> <p>Since time immemorial humans have...</p> <p>How often have we been told</p> <h2>INDIVIDUAL AND COLLECTIVE BEHAVIOR</h2> <p>Building a model of society based...</p> <p>All macroscopic systems...</p> <h3>COMPLEXITY PROFILE</h3> <p>It is much easier to think about the...</p> <p>A formal definition of scale considers...</p> <p>The complexity profile counts...</p> <h2>CONTROL IN HUMAN ORGANIZATIONS</h2> <p>Using this argument it is straightforward...</p> <h2>Conclusion</h2> <p>There are two natural conclusions...</p> """
I’ve tried with (producing messy output):
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
data = []
for item in soup.select("h1,h2,h3,h4,h5,h6"):
d = {}
d['title'] = item.text
d['text'] = [i.text for i in item.find_next_siblings('p')]
data.append(d)
print(json.dumps(data,indent=4))
Output I wish to get:
[
{
"title": "a Complexity Profile",
"text": [
"Since time immemorial humans have...",
"How often have we been told",
]
},
{
"title": "INDIVIDUAL AND COLLECTIVE BEHAVIOR",
"text": [
"Building a model of society based...",
"All macroscopic systems...",
]
},
{
"title": "COMPLEXITY PROFILE",
"text": [
"It is much easier to think about the...",
"A formal definition of scale considers...",
"The complexity profile counts...",
]
},
{
"title": "CONTROL IN HUMAN ORGANIZATIONS",
"text": [
"Using this argument it is straightforward...",
]
},
{
"title": "Conclusion",
"text": [
"There are two natural conclusions..."
]
}
]
Advertisement
Answer
Tricky problem. I think you have to handle things linearly:
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
data = []
pending = {}
for item in soup.select("h1,h2,h3,h4,h5,h6,p"):
if item.name == 'p':
pending['text'].append( item.text )
else:
if pending:
data.append(pending)
pending = {'title': item.text, 'text': [] }
data.append( pending )
print(json.dumps(data,indent=4))
Output:
timr@tims-gram:~/src$ python x.py
[
{
"title": "a Complexity Profile",
"text": [
"Since time immemorial humans have...",
"How often have we been told"
]
},
{
"title": "INDIVIDUAL AND COLLECTIVE BEHAVIOR",
"text": [
"Building a model of society based...",
"All macroscopic systems..."
]
},
{
"title": "COMPLEXITY PROFILE",
"text": [
"It is much easier to think about the...",
"A formal definition of scale considers...",
"The complexity profile counts..."
]
},
{
"title": "CONTROL IN HUMAN ORGANIZATIONS",
"text": [
"Using this argument it is straightforward..."
]
},
{
"title": "Conclusion",
"text": [
"There are two natural conclusions..."
]
}
]
timr@tims-gram:~/src$