I am trying to crawl a website searching for all JS files to download them. I am new to Scrapy and I have found that I can use CrawlSpider but seems I have an issue with LinkExtractors as my parser is not executed.
import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class JSDownloader(CrawlSpider): name = 'jsdownloader' allowed_domains = ['example.com'] start_urls = ['http://example.com'] rules = ( Rule(LinkExtractor(allow=('.js', )), callback='parse_item'), )) def parse_item(self, response): self.logger.info('JS File %s', response.url) item = scrapy.Item() # Process Item here yield item
Advertisement
Answer
I found that LinkExtractor has tags and attrs parameters where the default are for ‘a’ and ‘area’ tags only. LinxExtractor Documentation
So the solution is to add ” tag:
Rule(LinkExtractor(tags=('a', '<script>'), attrs('href','src')), callback='parse_item'),