I’m new to using Beautiful Soup and web scraping in general; I’m trying to build a dataframe that has the title, content, and publish date from a blog post style website (everything’s on one page, there’s a title, publish date, and then the post’s content). I’m able to get the title and publish date easily enough, but I can’t correctly pull the post’s content. each post is structured like so:
<h2 class = "thisYear" title = "Click here to display/hide information"> "First Post Title" </h2> <p class ="pubdate" style="display: block;"> 2022-07-11</p> <p style="display: block;"> "First paragraph of post"</p> <p style="display: block;"> "Second paragraph of post"</p> <h2 class = "thisYear" title = "Click here to display/hide information> "Second Post Title" </h2> <p class ="pubdate" style="display: block;"> 2022-07-07</p> <p style="display: block;"> "First paragraph of post"</p> <p style="display: block;"> "Second paragraph of post"</p>
Current Code:
r = requests.get(URL,allow_redirects=True) soup = BeautifulSoup(r.content, 'html5lib') tag = 'p' title_class_name = "thisYear" news_class_name = "thisYear" date_class_name = "pubdate" df = pd.DataFrame() title_list = [] news_list =[] date_list = [] title_table = soup.findAll('h2',attrs= {'class':title_class_name}) news_table = soup.findAll(tag,attrs= {'class': None}) date_table = soup.findAll(tag,attrs= {'class':date_class_name}) for (title , news, date) in zip(title_table, news_table, date_table): title_list.append(title.text) news_list.append(news.text) date_list.append(date.text) df['title'] = title_list df['news']=news_list df['publish_date']=date_list df
I think I see the problem, that it’s pulling each paragraph as a separate news entry, but I haven’t been able to correct that yet. How would I pull the content that is only in between each tag=’h2′ and class=’thisYear’ combination?
Advertisement
Answer
You can use for example tag.find_previous
to find to which block the paragraph belongs:
from bs4 import BeautifulSoup html_doc = """ <h2 class = "thisYear" title = "Click here to display/hide information"> "First Post Title" </h2> <p class ="pubdate" style="display: block;"> 2022-07-11</p> <p style="display: block;"> "First paragraph of post 1"</p> <p style="display: block;"> "Second paragraph of post 1"</p> <h2 class = "thisYear" title = "Click here to display/hide information"> "Second Post Title" </h2> <p class ="pubdate" style="display: block;"> 2022-07-07</p> <p style="display: block;"> "First paragraph of post 2"</p> <p style="display: block;"> "Second paragraph of post 2"</p>""" soup = BeautifulSoup(html_doc, "html.parser") out = {} for p in soup.select("h2.thisYear ~ p:not(.pubdate)"): title = p.find_previous("h2").text.strip() pubdate = p.find_previous(class_="pubdate").text.strip() out.setdefault((title, pubdate), []).append(p.text.strip()) print(out)
Prints:
{ ('"First Post Title"', "2022-07-11"): [ '"First paragraph of post 1"', '"Second paragraph of post 1"', ], ('"Second Post Title"', "2022-07-07"): [ '"First paragraph of post 2"', '"Second paragraph of post 2"', ], }
EDIT: To transform out
as a DataFrame you can do:
import pandas as pd df = pd.DataFrame( [ (title, date, "n".join(paragraphs)) for (title, date), paragraphs in out.items() ], columns=["Title", "Date", "Paragraphs"], ) print(df)
Prints:
Title Date Paragraphs 0 "First Post Title" 2022-07-11 "First paragraph of post 1"n"Second paragraph of post 1" 1 "Second Post Title" 2022-07-07 "First paragraph of post 2"n"Second paragraph of post 2"