Skip to content
Advertisement

Scrapy get only text ignoring the commented content

I researched but can’t find any answers to my question: I want get the main content, ignoring the commented content, how should I do?

<td>
<!--
  <i class="fab fa-youtube" aria-hidden="true" style="color: #f00;"></i>                                      
-->
main content
</td>

my scrapy spider looks like:

'name': row.xpath('td[2]/text()').get()

But this codes give me only some nt. plz help, thank you.

Advertisement

Answer

When /text() in XPath or ::text in CSS fails to produce the desired result, I use another library.

to install it.

pip3 install html2text
from html2text import HTML2Text
h = HTML2Text()
h.ignore_links = True
h.ignore_images = True
h.ignore_emphasis = True

#Inside the scrapy project
name = h.handle(row.xpath('td[2]').get()).strip()

yield ....

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement