I have the content below and I am trying to understand how to extract the <p> tag copy using Beautiful Soup (I am open to other methods). As you can see the <p> tags are not both nested inside the same <div>. I gave it a shot with the following method but that only seems to work when both <p> tags are within the same container.
<div class="top-panel">
<div class="inside-panel-0">
<h1 class="h1-title">Some Title</h1>
</div>
<div class="inside-panel-0">
<div class="inside-panel-1">
<p> I want to extract this copy</p>
</div>
<div class="inside-panel-1">
<p>I want to extract this copy</p>
</div>
</div>
</div>
Advertisement
Answer
As p tags are inside div class="inside-panel-1, so we can easily grab them by calling find_all method as follows:
from bs4 import BeautifulSoup
html = """
<div class="top-panel">
<div class="inside-panel-0">
<h1 class="h1-title">
Some Title
</h1>
</div>
<div class="inside-panel-0">
<div class="inside-panel-1">
<p>
I want to extract this copy
</p>
</div>
<div class="inside-panel-1">
<p>
I want to extract this copy
</p>
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
# print(soup.prettify())
p_tags = soup.select('div.top-panel div[class="inside-panel-1"]')
for p_tag in p_tags:
print(p_tag.get_text(strip=True))
Output:
I want to extract this copy I want to extract this copy