I have the content below and I am trying to understand how to extract the <p>
tag copy using Beautiful Soup (I am open to other methods). As you can see the <p>
tags are not both nested inside the same <div>
. I gave it a shot with the following method but that only seems to work when both <p>
tags are within the same container.
<div class="top-panel"> <div class="inside-panel-0"> <h1 class="h1-title">Some Title</h1> </div> <div class="inside-panel-0"> <div class="inside-panel-1"> <p> I want to extract this copy</p> </div> <div class="inside-panel-1"> <p>I want to extract this copy</p> </div> </div> </div>
Advertisement
Answer
As p tags are inside div class="inside-panel-1
, so we can easily grab them by calling find_all method as follows:
from bs4 import BeautifulSoup html = """ <div class="top-panel"> <div class="inside-panel-0"> <h1 class="h1-title"> Some Title </h1> </div> <div class="inside-panel-0"> <div class="inside-panel-1"> <p> I want to extract this copy </p> </div> <div class="inside-panel-1"> <p> I want to extract this copy </p> </div> </div> </div> """ soup = BeautifulSoup(html, 'html.parser') # print(soup.prettify()) p_tags = soup.select('div.top-panel div[class="inside-panel-1"]') for p_tag in p_tags: print(p_tag.get_text(strip=True))
Output:
I want to extract this copy I want to extract this copy