I’m using Beautiful Soup to put in a excel table some infos from a website.
The bold titles are shown in the head columns while the text after the colon appear in the rows.
What I’m doing is finding the text and searching for next_sibling –>
book_year = sibling.pre.find('b',text='Anno:').next_sibling.get_text().strip()
The problem is that in some cases the text after colon, is split in different #text part. So if I use the next_sibling, it’ll get only a partial info.
As you can see in the inspector, the content of Titoli originali: will only be “da” if I use next_sibling.
Is there a way to unify all those #text parts? How would you approach this problem? Thank you
UPDATES:
This is the website I’m scraping from –> http://www.letteraturenordiche.it/danimarca.htm
It’s giving me a hard time cause it has an incoherent structure and no use of classes.
One thing I did is to remove from the <pre>
content all of the tags, <font>
tags and <span>
tags, to leave only the <b>
ones and take the text after that.
Advertisement
Answer
Parsing this document isn’t pretty. Probable the document is hand-written in Word and then exported to HTML:
import requests import pandas as pd from bs4 import BeautifulSoup url = "http://www.letteraturenordiche.it/danimarca.htm" soup = BeautifulSoup(requests.get(url).content, "html.parser") # preprocess the document: # remove all whitespaces: for w in soup.find_all(text=True): if not w.strip(): w.extract() # unwrap not necessary tags: for t in soup.select("i, font, span"): t.unwrap() # merge NavigableStrings together: soup.smooth() data = [] for t in soup.select("table"): title = t.p.get_text(separator=" ", strip=True) year = ( t.select_one('b:-soup-contains("Anno:")') .find_next_sibling(text=True) .strip() ) author = ( t.find_previous("hr", attrs={"size": "6"}) .find_previous("p") .get_text(strip=True) ) editor = ( t.select_one('b:-soup-contains("Editore:")') .find_next_sibling(text=True) .strip() ) pages = ( t.select_one('b:-soup-contains("Pagine:")') .find_next_sibling(text=True) .strip() ) notes = ( t.select_one('b:-soup-contains("Note:", "Comprende")') .find_next_sibling(text=True) .strip() ) original_title = t.select_one( 'b:-soup-contains("Titolo Original", "Titolo original", "Titoli originali")' ) if not original_title: original_title = t.find(lambda t: t.text.strip() == ":") if not original_title: original_title = "" else: original_title = original_title.find_next_sibling(text=True).strip() data.append((title, year, author, editor, pages, notes, original_title)) df = pd.DataFrame( data, columns=[ "title", "year", "author", "editor", "pages", "notes", "original_title", ], ) df["title"] = df["title"].str.replace(r"r?n", " ", regex=True) df["author"] = df["author"].str.replace(r"r?n", " ", regex=True) print(df) df.to_csv("data.csv", index=False)
Creates the dataframe and saves it as data.csv
(screenshot from LibreOffice):