I researched but can’t find any answers to my question: I want get the main content, ignoring the commented content, how should I do?
<td> <!-- <i class="fab fa-youtube" aria-hidden="true" style="color: #f00;"></i> --> main content </td>
my scrapy spider looks like:
'name': row.xpath('td[2]/text()').get()
But this codes give me only some nt. plz help, thank you.
Advertisement
Answer
When /text() in XPath or ::text in CSS fails to produce the desired result, I use another library.
to install it.
pip3 install html2text
from html2text import HTML2Text h = HTML2Text() h.ignore_links = True h.ignore_images = True h.ignore_emphasis = True #Inside the scrapy project name = h.handle(row.xpath('td[2]').get()).strip() yield ....