I’m trying to fetch each heading and their corresponding paragraphs from the html elements below. The results should be stored within a dictionary. Whatever I’ve tried so far produces ludicrously haphazard output. I intentionally did not paste the current output only because of brevity of space.
JavaScript
x
21
21
1
html = """
2
<h1>a Complexity Profile</h1>
3
<p>Since time immemorial humans have</p>
4
<p>How often have we been told</p>
5
6
<h2>INDIVIDUAL AND COLLECTIVE BEHAVIOR</h2>
7
<p>Building a model of society based</p>
8
<p>All macroscopic systems</p>
9
10
<h3>COMPLEXITY PROFILE</h3>
11
<p>It is much easier to think about the</p>
12
<p>A formal definition of scale considers</p>
13
<p>The complexity profile counts</p>
14
15
<h2>CONTROL IN HUMAN ORGANIZATIONS</h2>
16
<p>Using this argument it is straightforward</p>
17
18
<h2>Conclusion</h2>
19
<p>There are two natural conclusions</p>
20
"""
21
I’ve tried with (producing messy output):
JavaScript
1
13
13
1
import json
2
from bs4 import BeautifulSoup
3
4
soup = BeautifulSoup(html,"lxml")
5
data = []
6
for item in soup.select("h1,h2,h3,h4,h5,h6"):
7
d = {}
8
d['title'] = item.text
9
d['text'] = [i.text for i in item.find_next_siblings('p')]
10
data.append(d)
11
12
print(json.dumps(data,indent=4))
13
Output I wish to get:
JavaScript
1
37
37
1
[
2
{
3
"title": "a Complexity Profile",
4
"text": [
5
"Since time immemorial humans have...",
6
"How often have we been told",
7
]
8
},
9
{
10
"title": "INDIVIDUAL AND COLLECTIVE BEHAVIOR",
11
"text": [
12
"Building a model of society based...",
13
"All macroscopic systems...",
14
]
15
},
16
{
17
"title": "COMPLEXITY PROFILE",
18
"text": [
19
"It is much easier to think about the...",
20
"A formal definition of scale considers...",
21
"The complexity profile counts...",
22
]
23
},
24
{
25
"title": "CONTROL IN HUMAN ORGANIZATIONS",
26
"text": [
27
"Using this argument it is straightforward...",
28
]
29
},
30
{
31
"title": "Conclusion",
32
"text": [
33
"There are two natural conclusions..."
34
]
35
}
36
]
37
Advertisement
Answer
Tricky problem. I think you have to handle things linearly:
JavaScript
1
17
17
1
import json
2
from bs4 import BeautifulSoup
3
4
soup = BeautifulSoup(html,"lxml")
5
data = []
6
pending = {}
7
for item in soup.select("h1,h2,h3,h4,h5,h6,p"):
8
if item.name == 'p':
9
pending['text'].append( item.text )
10
else:
11
if pending:
12
data.append(pending)
13
pending = {'title': item.text, 'text': [] }
14
data.append( pending )
15
16
print(json.dumps(data,indent=4))
17
Output:
JavaScript
1
39
39
1
timr@tims-gram:~/src$ python x.py
2
[
3
{
4
"title": "a Complexity Profile",
5
"text": [
6
"Since time immemorial humans have...",
7
"How often have we been told"
8
]
9
},
10
{
11
"title": "INDIVIDUAL AND COLLECTIVE BEHAVIOR",
12
"text": [
13
"Building a model of society based...",
14
"All macroscopic systems..."
15
]
16
},
17
{
18
"title": "COMPLEXITY PROFILE",
19
"text": [
20
"It is much easier to think about the...",
21
"A formal definition of scale considers...",
22
"The complexity profile counts..."
23
]
24
},
25
{
26
"title": "CONTROL IN HUMAN ORGANIZATIONS",
27
"text": [
28
"Using this argument it is straightforward..."
29
]
30
},
31
{
32
"title": "Conclusion",
33
"text": [
34
"There are two natural conclusions..."
35
]
36
}
37
]
38
timr@tims-gram:~/src$
39