Skip to content
Advertisement

How can I use scrapy middlewares to call a mail function?

I have 15 spiders and every spider has its own content to send mail. My spiders also have their own spider_closed method which starts the mail sender but all of them same. At some point, the spider count will be 100 and I don’t want to use the same functions again and again. Because of that, I try to use middlewares. I have been trying to use the spider_closed method in middlewares but it doesn’t work.

middlewares.py

class FirstBotSpiderMiddleware:

    def __init__(self, spider):
        self.spider = spider
    
    @classmethod
    def from_crawler(cls, crawler):
        print("crawler works")
        crawler.signals.connect(crawler.spider.spider_closed, signal=signals.spider_opened)
        return cls(crawler)

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)
    
    def spider_closed(self, spider,reason):
        print("Spider closed works")
        if reason == "finished":
            content = spider.name + " works.n"
        elif reason == "SD":
            content = spider.name + "works but same.n"
        else:
            content = spider.name + " error"
        
        self.mailSender(spider,content)


    def mailSender(self,location,content):
        print("mailSender works")
        mailer = MailSender()
        mailer = MailSender.from_settings(settings)
        mailer.send(to=["Some Mail"], subject=content, body="Some body")

settings

# Scrapy settings for first_bot project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import os
import sys



sys.path.append(os.path.dirname(os.path.abspath('.')))

BOT_NAME = 'first_bot'

SPIDER_MODULES = ['first_bot.spiders']
NEWSPIDER_MODULE = 'first_bot.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"


# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 2

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
    'first_bot.middlewares.FirstBotSpiderMiddleware': 543,
}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'first_bot.middlewares.FirstBotDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

#ITEM_PIPELINES = {
#    'first_bot.pipelines.FirstBotPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

I am not getting any error or any mail. I also add some print and there is no output.

How can I run middlewares with my spiders? What is your suggestions?

Advertisement

Answer

It is important to run spider from scrapy crawl command so it will see whole project configuration correctly. Also, you need to make sure that custom middleware is listed in SPIDER_MIDDLEWARES dict and assigned order number. Main entry point for middleware is from_crawler method, which should receive crawler instance. Then you can write your middleware processing logic here by following rules mentioned here.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement