I have a html
document that looks similar to this:
JavaScript
x
23
23
1
<div class='product'>
2
<table>
3
<tr>
4
random stuff here
5
</tr>
6
<tr class='line1'>
7
<td class='row'>
8
<span>TEXT I NEED</span>
9
</td>
10
</tr>
11
<tr class='line2'>
12
<td class='row'>
13
<span>MORE TEXT I NEED</span>
14
</td>
15
</tr>
16
<tr class='line3'>
17
<td class='row'>
18
<span>EVEN MORE TEXT I NEED</span>
19
</td>
20
</tr>
21
</table>
22
</div>
23
So i have used this code but i am getting the first text from the tr that’s not a class, and i need to ignore it:
JavaScript
1
2
1
soup.findAll('tr').text
2
Also, when I try to do just a class, this doesn’t seem to be valid python:
JavaScript
1
2
1
soup.findAll('tr', {'class'})
2
I would like some help extracting the text.
Advertisement
Answer
To get the desired output, use a CSS Selector to exclude the first <tr>
tag, and select the rest:
JavaScript
1
8
1
from bs4 import BeautifulSoup
2
3
4
soup = BeautifulSoup(html, 'html.parser')
5
6
for tag in soup.select('.product tr:not(.product tr:nth-of-type(1))'):
7
print(tag.text.strip())
8
Output :
JavaScript
1
4
1
TEXT I NEED
2
MORE TEXT I NEED
3
EVEN MORE TEXT I NEED
4