I have the content below and I am trying to understand how to extract the <p>
tag copy using Beautiful Soup (I am open to other methods). As you can see the <p>
tags are not both nested inside the same <div>
. I gave it a shot with the following method but that only seems to work when both <p>
tags are within the same container.
JavaScript
x
14
14
1
<div class="top-panel">
2
<div class="inside-panel-0">
3
<h1 class="h1-title">Some Title</h1>
4
</div>
5
<div class="inside-panel-0">
6
<div class="inside-panel-1">
7
<p> I want to extract this copy</p>
8
</div>
9
<div class="inside-panel-1">
10
<p>I want to extract this copy</p>
11
</div>
12
</div>
13
</div>
14
Advertisement
Answer
As p tags are inside div class="inside-panel-1
, so we can easily grab them by calling find_all method as follows:
JavaScript
1
32
32
1
from bs4 import BeautifulSoup
2
3
html = """
4
<div class="top-panel">
5
<div class="inside-panel-0">
6
<h1 class="h1-title">
7
Some Title
8
</h1>
9
</div>
10
<div class="inside-panel-0">
11
<div class="inside-panel-1">
12
<p>
13
I want to extract this copy
14
</p>
15
</div>
16
<div class="inside-panel-1">
17
<p>
18
I want to extract this copy
19
</p>
20
</div>
21
</div>
22
</div>
23
24
"""
25
26
soup = BeautifulSoup(html, 'html.parser')
27
# print(soup.prettify())
28
29
p_tags = soup.select('div.top-panel div[class="inside-panel-1"]')
30
for p_tag in p_tags:
31
print(p_tag.get_text(strip=True))
32
Output:
JavaScript
1
3
1
I want to extract this copy
2
I want to extract this copy
3