Skip to content
Advertisement

Can’t stratify output based on different headings and their corresponding paragraphs

I’m trying to fetch each heading and their corresponding paragraphs from the html elements below. The results should be stored within a dictionary. Whatever I’ve tried so far produces ludicrously haphazard output. I intentionally did not paste the current output only because of brevity of space.

html = """
<h1>a Complexity Profile</h1>
<p>Since time immemorial humans have...</p>
<p>How often have we been told</p>

<h2>INDIVIDUAL AND COLLECTIVE BEHAVIOR</h2>
<p>Building a model of society based...</p>
<p>All macroscopic systems...</p>

<h3>COMPLEXITY PROFILE</h3>
<p>It is much easier to think about the...</p>
<p>A formal definition of scale considers...</p>
<p>The complexity profile counts...</p>

<h2>CONTROL IN HUMAN ORGANIZATIONS</h2>
<p>Using this argument it is straightforward...</p>

<h2>Conclusion</h2>
<p>There are two natural conclusions...</p>
"""

I’ve tried with (producing messy output):

import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"lxml")
data = []
for item in soup.select("h1,h2,h3,h4,h5,h6"):
    d = {}
    d['title'] = item.text
    d['text'] = [i.text for i in item.find_next_siblings('p')]
    data.append(d)

print(json.dumps(data,indent=4))

Output I wish to get:

[
    {
        "title": "a Complexity Profile",
        "text": [
            "Since time immemorial humans have...",
            "How often have we been told",
        ]
    },
    {
        "title": "INDIVIDUAL AND COLLECTIVE BEHAVIOR",
        "text": [
            "Building a model of society based...",
            "All macroscopic systems...",
        ]
    },
    {
        "title": "COMPLEXITY PROFILE",
        "text": [
            "It is much easier to think about the...",
            "A formal definition of scale considers...",
            "The complexity profile counts...",
        ]
    },
    {
        "title": "CONTROL IN HUMAN ORGANIZATIONS",
        "text": [
            "Using this argument it is straightforward...",
        ]
    },
    {
        "title": "Conclusion",
        "text": [
            "There are two natural conclusions..."
        ]
    }
]

Advertisement

Answer

Tricky problem. I think you have to handle things linearly:

import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"lxml")
data = []
pending = {}
for item in soup.select("h1,h2,h3,h4,h5,h6,p"):
    if item.name == 'p':
        pending['text'].append( item.text )
    else:
        if pending:
            data.append(pending)
        pending = {'title': item.text, 'text': [] }
data.append( pending )

print(json.dumps(data,indent=4))

Output:

timr@tims-gram:~/src$ python x.py
[
    {
        "title": "a Complexity Profile",
        "text": [
            "Since time immemorial humans have...",
            "How often have we been told"
        ]
    },
    {
        "title": "INDIVIDUAL AND COLLECTIVE BEHAVIOR",
        "text": [
            "Building a model of society based...",
            "All macroscopic systems..."
        ]
    },
    {
        "title": "COMPLEXITY PROFILE",
        "text": [
            "It is much easier to think about the...",
            "A formal definition of scale considers...",
            "The complexity profile counts..."
        ]
    },
    {
        "title": "CONTROL IN HUMAN ORGANIZATIONS",
        "text": [
            "Using this argument it is straightforward..."
        ]
    },
    {
        "title": "Conclusion",
        "text": [
            "There are two natural conclusions..."
        ]
    }
]
timr@tims-gram:~/src$ 
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement