Python

I’m new to using Beautiful Soup and web scraping in general; I’m trying to build a dataframe that has the title, content, and publish date from a blog post style website (everything’s on one page, there’s a title, publish date, and then the post’s content). I’m able to get the title and publish date easily enough, but I can’t correctly pull the post’s content. each post is structured like so:

<h2 class = "thisYear" title = "Click here to display/hide information">
"First Post Title" </h2>
<p class ="pubdate" style="display: block;"> 2022-07-11</p>
<p style="display: block;"> "First paragraph of post"</p>
<p style="display: block;"> "Second paragraph of post"</p>
<h2 class = "thisYear" title = "Click here to display/hide information>
"Second Post Title" </h2>
<p class ="pubdate" style="display: block;"> 2022-07-07</p>
<p style="display: block;"> "First paragraph of post"</p>
<p style="display: block;"> "Second paragraph of post"</p>

JavaScript
​x
 
<h2 class = "thisYear" title = "Click here to display/hide information">
"First Post Title" </h2>
<p class ="pubdate" style="display: block;"> 2022-07-11</p>
<p style="display: block;"> "First paragraph of post"</p>
<p style="display: block;"> "Second paragraph of post"</p>
<h2 class = "thisYear" title = "Click here to display/hide information>
"Second Post Title" </h2>
<p class ="pubdate" style="display: block;"> 2022-07-07</p>
<p style="display: block;"> "First paragraph of post"</p>
<p style="display: block;"> "Second paragraph of post"</p>
​

Current Code:

r = requests.get(URL,allow_redirects=True)
soup = BeautifulSoup(r.content, 'html5lib')
    
tag = 'p'
title_class_name = "thisYear"
news_class_name = "thisYear"
date_class_name = "pubdate"


df = pd.DataFrame()
title_list = []
news_list =[]
date_list = []

title_table = soup.findAll('h2',attrs= {'class':title_class_name})
news_table = soup.findAll(tag,attrs= {'class': None})
date_table = soup.findAll(tag,attrs= {'class':date_class_name})

for (title , news, date) in zip(title_table, news_table, date_table):
    title_list.append(title.text)
    news_list.append(news.text)
    date_list.append(date.text)
df['title'] = title_list
df['news']=news_list
df['publish_date']=date_list
df

JavaScript
 
r = requests.get(URL,allow_redirects=True)
soup = BeautifulSoup(r.content, 'html5lib')
    
tag = 'p'
title_class_name = "thisYear"
news_class_name = "thisYear"
date_class_name = "pubdate"
​
​
df = pd.DataFrame()
title_list = []
news_list =[]
date_list = []
​
title_table = soup.findAll('h2',attrs= {'class':title_class_name})
news_table = soup.findAll(tag,attrs= {'class': None})
date_table = soup.findAll(tag,attrs= {'class':date_class_name})
​
for (title , news, date) in zip(title_table, news_table, date_table):
    title_list.append(title.text)
    news_list.append(news.text)
    date_list.append(date.text)
df['title'] = title_list
df['news']=news_list
df['publish_date']=date_list
df
​

I think I see the problem, that it’s pulling each paragraph as a separate news entry, but I haven’t been able to correct that yet. How would I pull the content that is only in between each tag=’h2′ and class=’thisYear’ combination?

Answer

You can use for example tag.find_previous to find to which block the paragraph belongs:

from bs4 import BeautifulSoup

html_doc = """
<h2 class = "thisYear" title = "Click here to display/hide information">
"First Post Title" </h2>
<p class ="pubdate" style="display: block;"> 2022-07-11</p>
<p style="display: block;"> "First paragraph of post 1"</p>
<p style="display: block;"> "Second paragraph of post 1"</p>
<h2 class = "thisYear" title = "Click here to display/hide information">
"Second Post Title" </h2>
<p class ="pubdate" style="display: block;"> 2022-07-07</p>
<p style="display: block;"> "First paragraph of post 2"</p>
<p style="display: block;"> "Second paragraph of post 2"</p>"""

soup = BeautifulSoup(html_doc, "html.parser")

out = {}
for p in soup.select("h2.thisYear ~ p:not(.pubdate)"):
    title = p.find_previous("h2").text.strip()
    pubdate = p.find_previous(class_="pubdate").text.strip()
    out.setdefault((title, pubdate), []).append(p.text.strip())

print(out)

JavaScript
 
from bs4 import BeautifulSoup
​
html_doc = """
<h2 class = "thisYear" title = "Click here to display/hide information">
"First Post Title" </h2>
<p class ="pubdate" style="display: block;"> 2022-07-11</p>
<p style="display: block;"> "First paragraph of post 1"</p>
<p style="display: block;"> "Second paragraph of post 1"</p>
<h2 class = "thisYear" title = "Click here to display/hide information">
"Second Post Title" </h2>
<p class ="pubdate" style="display: block;"> 2022-07-07</p>
<p style="display: block;"> "First paragraph of post 2"</p>
<p style="display: block;"> "Second paragraph of post 2"</p>"""
​
soup = BeautifulSoup(html_doc, "html.parser")
​
out = {}
for p in soup.select("h2.thisYear ~ p:not(.pubdate)"):
    title = p.find_previous("h2").text.strip()
    pubdate = p.find_previous(class_="pubdate").text.strip()
    out.setdefault((title, pubdate), []).append(p.text.strip())
​
print(out)
​

Prints:

{
    ('"First Post Title"', "2022-07-11"): [
        '"First paragraph of post 1"',
        '"Second paragraph of post 1"',
    ],
    ('"Second Post Title"', "2022-07-07"): [
        '"First paragraph of post 2"',
        '"Second paragraph of post 2"',
    ],
}

JavaScript
 
{
    ('"First Post Title"', "2022-07-11"): [
        '"First paragraph of post 1"',
        '"Second paragraph of post 1"',
    ],
    ('"Second Post Title"', "2022-07-07"): [
        '"First paragraph of post 2"',
        '"Second paragraph of post 2"',
    ],
}
​

EDIT: To transform out as a DataFrame you can do:

import pandas as pd


df = pd.DataFrame(
    [
        (title, date, "n".join(paragraphs))
        for (title, date), paragraphs in out.items()
    ],
    columns=["Title", "Date", "Paragraphs"],
)
print(df)

JavaScript
 
import pandas as pd
​
​
df = pd.DataFrame(
    [
        (title, date, "n".join(paragraphs))
        for (title, date), paragraphs in out.items()
    ],
    columns=["Title", "Date", "Paragraphs"],
)
print(df)
​

Prints:

                 Title        Date                                                 Paragraphs
0   "First Post Title"  2022-07-11  "First paragraph of post 1"n"Second paragraph of post 1"
1  "Second Post Title"  2022-07-07  "First paragraph of post 2"n"Second paragraph of post 2"

JavaScript
 
                 Title        Date                                                 Paragraphs
0   "First Post Title"  2022-07-11  "First paragraph of post 1"n"Second paragraph of post 1"
1  "Second Post Title"  2022-07-07  "First paragraph of post 2"n"Second paragraph of post 2"
​

Python & Beautiful Soup – Extract text between a specific tag and class combination

Advertisement

Answer