Skip to content
Advertisement

Use Beautiful Soup to unify #text after a tag

I’m using Beautiful Soup to put in a excel table some infos from a website. enter image description here

The bold titles are shown in the head columns while the text after the colon appear in the rows.

What I’m doing is finding the text and searching for next_sibling –>

  book_year = sibling.pre.find('b',text='Anno:').next_sibling.get_text().strip()

The problem is that in some cases the text after colon, is split in different #text part. So if I use the next_sibling, it’ll get only a partial info.

enter image description here

As you can see in the inspector, the content of Titoli originali: will only be “da” if I use next_sibling.

Is there a way to unify all those #text parts? How would you approach this problem? Thank you

UPDATES:

This is the website I’m scraping from –> http://www.letteraturenordiche.it/danimarca.htm

It’s giving me a hard time cause it has an incoherent structure and no use of classes.

One thing I did is to remove from the <pre> content all of the tags, <font> tags and <span> tags, to leave only the <b> ones and take the text after that.

Advertisement

Answer

Parsing this document isn’t pretty. Probable the document is hand-written in Word and then exported to HTML:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "http://www.letteraturenordiche.it/danimarca.htm"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

# preprocess the document:

# remove all whitespaces:
for w in soup.find_all(text=True):
    if not w.strip():
        w.extract()

# unwrap not necessary tags:
for t in soup.select("i, font, span"):
    t.unwrap()

# merge NavigableStrings together:
soup.smooth()

data = []
for t in soup.select("table"):
    title = t.p.get_text(separator=" ", strip=True)
    year = (
        t.select_one('b:-soup-contains("Anno:")')
        .find_next_sibling(text=True)
        .strip()
    )
    author = (
        t.find_previous("hr", attrs={"size": "6"})
        .find_previous("p")
        .get_text(strip=True)
    )
    editor = (
        t.select_one('b:-soup-contains("Editore:")')
        .find_next_sibling(text=True)
        .strip()
    )
    pages = (
        t.select_one('b:-soup-contains("Pagine:")')
        .find_next_sibling(text=True)
        .strip()
    )
    notes = (
        t.select_one('b:-soup-contains("Note:", "Comprende")')
        .find_next_sibling(text=True)
        .strip()
    )
    original_title = t.select_one(
        'b:-soup-contains("Titolo Original", "Titolo original", "Titoli originali")'
    )

    if not original_title:
        original_title = t.find(lambda t: t.text.strip() == ":")

    if not original_title:
        original_title = ""
    else:
        original_title = original_title.find_next_sibling(text=True).strip()

    data.append((title, year, author, editor, pages, notes, original_title))

df = pd.DataFrame(
    data,
    columns=[
        "title",
        "year",
        "author",
        "editor",
        "pages",
        "notes",
        "original_title",
    ],
)
df["title"] = df["title"].str.replace(r"r?n", " ", regex=True)
df["author"] = df["author"].str.replace(r"r?n", " ", regex=True)
print(df)
df.to_csv("data.csv", index=False)

Creates the dataframe and saves it as data.csv (screenshot from LibreOffice):

enter image description here

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement