I researched but can’t find any answers to my question: I want get the main content, ignoring the commented content, how should I do?
JavaScript
x
7
1
<td>
2
<!--
3
<i class="fab fa-youtube" aria-hidden="true" style="color: #f00;"></i>
4
-->
5
main content
6
</td>
7
my scrapy spider looks like:
JavaScript
1
2
1
'name': row.xpath('td[2]/text()').get()
2
But this codes give me only some nt. plz help, thank you.
Advertisement
Answer
When /text() in XPath or ::text in CSS fails to produce the desired result, I use another library.
to install it.
JavaScript
1
2
1
pip3 install html2text
2
JavaScript
1
12
12
1
from html2text import HTML2Text
2
h = HTML2Text()
3
h.ignore_links = True
4
h.ignore_images = True
5
h.ignore_emphasis = True
6
7
#Inside the scrapy project
8
name = h.handle(row.xpath('td[2]').get()).strip()
9
10
yield .
11
12