Python & Beautiful Soup – Extract text between a specific tag and class combination

Question

I'm new to using Beautiful Soup and web scraping in general; I'm trying to build a dataframe that has the title, content, and publish date from a blog post style website (everything's on one page, there's a title, publish date, and then the post's content). I'm able to get the title and publish date easily enough, but I can't correctly

Accepted Answer

You can use for example tag.find_previous to find to which block the paragraph belongs:from bs4 import BeautifulSouphtml_doc = """

"First Post Title"

2022-07-11

"First paragraph of post 1"

"Second paragraph of post 1"

"Second Post Title"

2022-07-07

"First paragraph of post 2"

"Second paragraph of post 2"

"""soup = BeautifulSoup(html_doc, "html.parser")out = {}for p in soup.select("h2.thisYear ~ p:not(.pubdate)"): title = p.find_previous("h2").text.strip() pubdate = p.find_previous(class_="pubdate").text.strip() out.setdefault((title, pubdate), []).append(p.text.strip())print(out)Prints:{ ('"First Post Title"', "2022-07-11"): [ '"First paragraph of post 1"', '"Second paragraph of post 1"', ], ('"Second Post Title"', "2022-07-07"): [ '"First paragraph of post 2"', '"Second paragraph of post 2"', ],}EDIT: To transform out as a DataFrame you can do:import pandas as pddf = pd.DataFrame( [ (title, date, "n".join(paragraphs)) for (title, date), paragraphs in out.items() ], columns=["Title", "Date", "Paragraphs"],)print(df)Prints: Title Date Paragraphs0 "First Post Title" 2022-07-11 "First paragraph of post 1"n"Second paragraph of post 1"1 "Second Post Title" 2022-07-07 "First paragraph of post 2"n"Second paragraph of post 2"

Advertisement

Answer