Skip to content
Advertisement

Install Scrapy on Windows Server 2019, running in a Docker container

I want to install Scrapy on Windows Server 2019, running in a Docker container (please see here and here for the history of my installation).

On my local Windows 10 machine I can run my Scrapy commands like so in Windows PowerShell (after simply starting Docker Desktop): scrapy crawl myscraper -o allobjects.json in folder C:scrapymy1stscraper

For Windows Server as recommended here I first installed Anaconda following these steps: https://docs.scrapy.org/en/latest/intro/install.html.

I then opened the Anaconda prompt and typed conda install -c conda-forge scrapy in D:Programs

JavaScript

In PowerShell on my VPS I then tried to run scrapy via D:ProgramsAnaconda3Scriptsscrapy.exe

I want to run the spider I have stored in folder D:scrapymy1stscraper, see: enter image description here enter image description here

The Docker Engine service is running as a Windows Service (presuming I don’t need to explicitly start a container when running my scrapy command, if I do, I would not know how): enter image description here

I tried starting my scraper like so D:ProgramsAnaconda3Scriptsscrapy.exe crawl D:scrapymy1stscraperspidersmy1stscraper -o allobjects.json, resulting in errors:

JavaScript

I checked here: from lxml import etree ImportError: DLL load failed: The specified module could not be found

This talks about pip, which I did not use, but to be sure I did install the C++ build tools: enter image description here

I still get the same error. How can I run my Scrapy crawler in the Docker container?

UPDATE 1

My VPS is my only environment so not sure how to test in a virtual environment.

What I did now:

Looking at your recommendations:

Get steps to manually install the app on Windows Server – ideally test in a virtualised environment so you can reset it cleanly

  1. When you say app, what do you mean? Scrapy? Conda?

Convert all steps to a fully automatic powershell script (e.g. for conda, need to download the installer via wget, execute the installer etc.

  1. I now installed Conda on the host OS, since I thought that would allow me to have the least amount of overhead. Or would you install it in the image directly and if so, how do I not have to install it each time?

  2. Lastly, just to check to be sure, I want to run multiple Scrapy scrapers, but I want to do this with as little overhead as possible. I should just repeat the RUN command in the SAME docker container for each scraper I want to execute, correct?

UPDATE 2

whomami indeed returns user managercontaineradministrator

scrapy benchmark returns

JavaScript

I have the scrapy project I want to run in folder D:scrapymy1stscraper, how can I run that project, since D: drive is not available within my container?

UPDATE 3

A few months later when we discussed this, when I now run your proposed the Dockerfile it breaks and I now get this output:

JavaScript

Not sure if I’m reading this correctly but it seems as if Scrapy does not support Python 3.9, except that here I see “Scrapy requires Python 3.6+” https://docs.scrapy.org/en/latest/intro/install.html Do you know what’s causing this issue? I also checked here but no answer yet either.

Advertisement

Answer

To run a containerised app, it must be installed in a container image first – you don’t want to install any software on the host machine.

For linux there are off-the-shelf container images for everything which is probably what your docker desktop environment was using; I see 1051 results on docker hub search for scrapy but none of them are windows containers.

The full process of creating a windows container from scratch for an app is:

  • Get steps to manually install the app (scrapy and its dependencies) on Windows Server – ideally test in a virtualised environment so you can reset it cleanly
  • Convert all steps to a fully automatic powershell script (e.g. for conda, need to download the installer via wget, execute the installer etc.
  • Optionaly, test the powershell steps in an interactive container
    • docker run -it --isolation=process mcr.microsoft.com/windows/servercore:ltsc2019 powershell
    • This runs a windows container and gives you a shell to verify that your install script works
    • When you exit the shell the container is stopped
  • Create a Dockerfile
    • Use mcr.microsoft.com/windows/servercore:ltsc2019 as the base image via FROM
    • Use the RUN command for each line of your powershell script

I tried installing scrapy on an existing windows Dockerfile that used conda / python 3.6, it threw error SettingsFrame has no attribute 'ENABLE_CONNECT_PROTOCOL' at a similar stage.

However I tried again with miniconda and python 3.8, and was able to get scrapy running, here’s the dockerfile:

JavaScript

Build it with docker build . -t scrapy and run with docker run -it scrapy.

To verify you are running a shell inside the container run whoami – should return user managercontaineradministrator.

Then, scrapy benchmark command should be able to run and dump some stats. The container will stop when you close the shell.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement