I am trying to crawl a website searching for all JS files to download them. I am new to Scrapy and I have found that I can use CrawlSpider but seems I have an issue with LinkExtractors as my parser is not executed.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class JSDownloader(CrawlSpider):
name = 'jsdownloader'
allowed_domains = ['example.com']
start_urls = ['http://example.com']
rules = (
Rule(LinkExtractor(allow=('.js', )), callback='parse_item'),
))
def parse_item(self, response):
self.logger.info('JS File %s', response.url)
item = scrapy.Item()
# Process Item here
yield item
Advertisement
Answer
I found that LinkExtractor has tags and attrs parameters where the default are for ‘a’ and ‘area’ tags only. LinxExtractor Documentation
So the solution is to add ” tag:
Rule(LinkExtractor(tags=('a', '<script>'), attrs('href','src')), callback='parse_item'),