I am trying to crawl a website searching for all JS files to download them. I am new to Scrapy and I have found that I can use CrawlSpider but seems I have an issue with LinkExtractors as my parser is not executed.
JavaScript
x
20
20
1
import scrapy
2
from scrapy.spiders import CrawlSpider, Rule
3
from scrapy.linkextractors import LinkExtractor
4
5
class JSDownloader(CrawlSpider):
6
name = 'jsdownloader'
7
allowed_domains = ['example.com']
8
start_urls = ['http://example.com']
9
10
11
rules = (
12
Rule(LinkExtractor(allow=('.js', )), callback='parse_item'),
13
))
14
15
def parse_item(self, response):
16
self.logger.info('JS File %s', response.url)
17
item = scrapy.Item()
18
# Process Item here
19
yield item
20
Advertisement
Answer
I found that LinkExtractor has tags and attrs parameters where the default are for ‘a’ and ‘area’ tags only. LinxExtractor Documentation
So the solution is to add ” tag:
JavaScript
1
2
1
Rule(LinkExtractor(tags=('a', '<script>'), attrs('href','src')), callback='parse_item'),
2