Skip to content
Advertisement

Downloading all JS files using Scrapy?

I am trying to crawl a website searching for all JS files to download them. I am new to Scrapy and I have found that I can use CrawlSpider but seems I have an issue with LinkExtractors as my parser is not executed.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class JSDownloader(CrawlSpider):
  name = 'jsdownloader'
  allowed_domains = ['example.com']
  start_urls = ['http://example.com']


  rules = (
    Rule(LinkExtractor(allow=('.js', )), callback='parse_item'),
  ))

  def parse_item(self, response):
    self.logger.info('JS File %s', response.url)
    item = scrapy.Item()
    # Process Item here
    yield item

Advertisement

Answer

I found that LinkExtractor has tags and attrs parameters where the default are for ‘a’ and ‘area’ tags only. LinxExtractor Documentation

So the solution is to add ” tag:

Rule(LinkExtractor(tags=('a', '<script>'), attrs('href','src')), callback='parse_item'),
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement