How to extract deeply nested tags using Beautiful Soup

I have the content below and I am trying to understand how to extract the <p> tag copy using Beautiful Soup (I am open to other methods). As you can see the <p> tags are not both nested inside the same <div>. I gave it a shot with the following method but that only seems to work when both <p> tags are within the same container.

<div class="top-panel">
  <div class="inside-panel-0">
    <h1 class="h1-title">Some Title</h1>
  </div>
  <div class="inside-panel-0">
    <div class="inside-panel-1">
      <p> I want to extract this copy</p>
    </div>
    <div class="inside-panel-1">
      <p>I want to extract this copy</p>
    </div>
  </div>
</div>

JavaScript
​x
 
<div class="top-panel">
  <div class="inside-panel-0">
    <h1 class="h1-title">Some Title</h1>
  </div>
  <div class="inside-panel-0">
    <div class="inside-panel-1">
      <p> I want to extract this copy</p>
    </div>
    <div class="inside-panel-1">
      <p>I want to extract this copy</p>
    </div>
  </div>
</div>
​

Answer

As p tags are inside div class="inside-panel-1, so we can easily grab them by calling find_all method as follows:

from bs4 import BeautifulSoup

html = """
<div class="top-panel">        
 <div class="inside-panel-0">  
  <h1 class="h1-title">        
   Some Title
  </h1>
 </div>
 <div class="inside-panel-0">  
  <div class="inside-panel-1"> 
   <p>
    I want to extract this copy
   </p>
  </div>
  <div class="inside-panel-1"> 
   <p>
    I want to extract this copy
   </p>
  </div>
 </div>
</div>

"""

soup = BeautifulSoup(html, 'html.parser')
# print(soup.prettify())

p_tags = soup.select('div.top-panel div[class="inside-panel-1"]')
for p_tag in p_tags:
    print(p_tag.get_text(strip=True))

JavaScript
 
from bs4 import BeautifulSoup
​
html = """
<div class="top-panel">        
 <div class="inside-panel-0">  
  <h1 class="h1-title">        
   Some Title
  </h1>
 </div>
 <div class="inside-panel-0">  
  <div class="inside-panel-1"> 
   <p>
    I want to extract this copy
   </p>
  </div>
  <div class="inside-panel-1"> 
   <p>
    I want to extract this copy
   </p>
  </div>
 </div>
</div>
​
"""
​
soup = BeautifulSoup(html, 'html.parser')
# print(soup.prettify())
​
p_tags = soup.select('div.top-panel div[class="inside-panel-1"]')
for p_tag in p_tags:
    print(p_tag.get_text(strip=True))
​

Output:

I want to extract this copy
I want to extract this copy

JavaScript
 
I want to extract this copy
I want to extract this copy
​

Advertisement

Answer