Skip to content
Advertisement

Install Scrapy on Windows Server 2019, running in a Docker container

I want to install Scrapy on Windows Server 2019, running in a Docker container (please see here and here for the history of my installation).

On my local Windows 10 machine I can run my Scrapy commands like so in Windows PowerShell (after simply starting Docker Desktop): scrapy crawl myscraper -o allobjects.json in folder C:scrapymy1stscraper

For Windows Server as recommended here I first installed Anaconda following these steps: https://docs.scrapy.org/en/latest/intro/install.html.

I then opened the Anaconda prompt and typed conda install -c conda-forge scrapy in D:Programs

(base) PS D:Programs> dir


    Directory: D:Programs


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
d-----        4/22/2021  10:52 AM                Anaconda3
-a----        4/22/2021  11:20 AM              0 conda


(base) PS D:Programs> conda install -c conda-forge scrapy
Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.9.2
  latest version: 4.10.1

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: D:ProgramsAnaconda3

  added / updated specs:
    - scrapy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    automat-20.2.0             |             py_0          30 KB  conda-forge
    conda-4.10.1               |   py38haa244fe_0         3.1 MB  conda-forge
    constantly-15.1.0          |             py_0           9 KB  conda-forge
    cssselect-1.1.0            |             py_0          18 KB  conda-forge
    hyperlink-21.0.0           |     pyhd3deb0d_0          71 KB  conda-forge
    incremental-17.5.0         |             py_0          14 KB  conda-forge
    itemadapter-0.2.0          |     pyhd8ed1ab_0          12 KB  conda-forge
    parsel-1.6.0               |             py_0          15 KB  conda-forge
    pyasn1-0.4.8               |             py_0          53 KB  conda-forge
    pyasn1-modules-0.2.7       |             py_0          60 KB  conda-forge
    pydispatcher-2.0.5         |             py_1          12 KB  conda-forge
    pyhamcrest-2.0.2           |             py_0          29 KB  conda-forge
    python_abi-3.8             |           1_cp38           4 KB  conda-forge
    queuelib-1.6.1             |     pyhd8ed1ab_0          14 KB  conda-forge
    scrapy-2.4.1               |   py38haa95532_0         372 KB
    service_identity-18.1.0    |             py_0          12 KB  conda-forge
    twisted-21.2.0             |   py38h294d835_0         5.1 MB  conda-forge
    twisted-iocpsupport-1.0.1  |   py38h294d835_0          49 KB  conda-forge
    w3lib-1.22.0               |     pyh9f0ad1d_0          21 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         9.0 MB

The following NEW packages will be INSTALLED:

  automat            conda-forge/noarch::automat-20.2.0-py_0
  constantly         conda-forge/noarch::constantly-15.1.0-py_0
  cssselect          conda-forge/noarch::cssselect-1.1.0-py_0
  hyperlink          conda-forge/noarch::hyperlink-21.0.0-pyhd3deb0d_0
  incremental        conda-forge/noarch::incremental-17.5.0-py_0
  itemadapter        conda-forge/noarch::itemadapter-0.2.0-pyhd8ed1ab_0
  parsel             conda-forge/noarch::parsel-1.6.0-py_0
  pyasn1             conda-forge/noarch::pyasn1-0.4.8-py_0
  pyasn1-modules     conda-forge/noarch::pyasn1-modules-0.2.7-py_0
  pydispatcher       conda-forge/noarch::pydispatcher-2.0.5-py_1
  pyhamcrest         conda-forge/noarch::pyhamcrest-2.0.2-py_0
  python_abi         conda-forge/win-64::python_abi-3.8-1_cp38
  queuelib           conda-forge/noarch::queuelib-1.6.1-pyhd8ed1ab_0
  scrapy             pkgs/main/win-64::scrapy-2.4.1-py38haa95532_0
  service_identity   conda-forge/noarch::service_identity-18.1.0-py_0
  twisted            conda-forge/win-64::twisted-21.2.0-py38h294d835_0
  twisted-iocpsuppo~ conda-forge/win-64::twisted-iocpsupport-1.0.1-py38h294d835_0
  w3lib              conda-forge/noarch::w3lib-1.22.0-pyh9f0ad1d_0

The following packages will be UPDATED:

  conda               pkgs/main::conda-4.9.2-py38haa95532_0 --> conda-forge::conda-4.10.1-py38haa244fe_0


Proceed ([y]/n)? y


Downloading and Extracting Packages
constantly-15.1.0    | 9 KB      | ############################################################################ | 100%
itemadapter-0.2.0    | 12 KB     | ############################################################################ | 100%
twisted-21.2.0       | 5.1 MB    | ############################################################################ | 100%
pydispatcher-2.0.5   | 12 KB     | ############################################################################ | 100%
queuelib-1.6.1       | 14 KB     | ############################################################################ | 100%
service_identity-18. | 12 KB     | ############################################################################ | 100%
pyhamcrest-2.0.2     | 29 KB     | ############################################################################ | 100%
cssselect-1.1.0      | 18 KB     | ############################################################################ | 100%
automat-20.2.0       | 30 KB     | ############################################################################ | 100%
pyasn1-0.4.8         | 53 KB     | ############################################################################ | 100%
twisted-iocpsupport- | 49 KB     | ############################################################################ | 100%
python_abi-3.8       | 4 KB      | ############################################################################ | 100%
hyperlink-21.0.0     | 71 KB     | ############################################################################ | 100%
conda-4.10.1         | 3.1 MB    | ############################################################################ | 100%
scrapy-2.4.1         | 372 KB    | ############################################################################ | 100%
incremental-17.5.0   | 14 KB     | ############################################################################ | 100%
w3lib-1.22.0         | 21 KB     | ############################################################################ | 100%
pyasn1-modules-0.2.7 | 60 KB     | ############################################################################ | 100%
parsel-1.6.0         | 15 KB     | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(base) PS D:Programs>

In PowerShell on my VPS I then tried to run scrapy via D:ProgramsAnaconda3Scriptsscrapy.exe

I want to run the spider I have stored in folder D:scrapymy1stscraper, see: enter image description here enter image description here

The Docker Engine service is running as a Windows Service (presuming I don’t need to explicitly start a container when running my scrapy command, if I do, I would not know how): enter image description here

I tried starting my scraper like so D:ProgramsAnaconda3Scriptsscrapy.exe crawl D:scrapymy1stscraperspidersmy1stscraper -o allobjects.json, resulting in errors:

Traceback (most recent call last):
  File "D:ProgramsAnaconda3Scriptsscrapy-script.py", line 6, in <module>
    from scrapy.cmdline import execute
  File "D:ProgramsAnaconda3libsite-packagesscrapy__init__.py", line 12, in <module>
    from scrapy.spiders import Spider
  File "D:ProgramsAnaconda3libsite-packagesscrapyspiders__init__.py", line 11, in <module>
    from scrapy.http import Request
  File "D:ProgramsAnaconda3libsite-packagesscrapyhttp__init__.py", line 11, in <module>
    from scrapy.http.request.form import FormRequest
  File "D:ProgramsAnaconda3libsite-packagesscrapyhttprequestform.py", line 10, in <module>
    import lxml.html
  File "D:ProgramsAnaconda3libsite-packageslxmlhtml__init__.py", line 53, in <module>
    from .. import etree
ImportError: DLL load failed while importing etree: The specified module could not be found.

I checked here: from lxml import etree ImportError: DLL load failed: The specified module could not be found

This talks about pip, which I did not use, but to be sure I did install the C++ build tools: enter image description here

I still get the same error. How can I run my Scrapy crawler in the Docker container?

UPDATE 1

My VPS is my only environment so not sure how to test in a virtual environment.

What I did now:

Looking at your recommendations:

Get steps to manually install the app on Windows Server – ideally test in a virtualised environment so you can reset it cleanly

  1. When you say app, what do you mean? Scrapy? Conda?

Convert all steps to a fully automatic powershell script (e.g. for conda, need to download the installer via wget, execute the installer etc.

  1. I now installed Conda on the host OS, since I thought that would allow me to have the least amount of overhead. Or would you install it in the image directly and if so, how do I not have to install it each time?

  2. Lastly, just to check to be sure, I want to run multiple Scrapy scrapers, but I want to do this with as little overhead as possible. I should just repeat the RUN command in the SAME docker container for each scraper I want to execute, correct?

UPDATE 2

whomami indeed returns user managercontaineradministrator

scrapy benchmark returns

Scrapy 2.4.1 - no active project
Unknown command: benchmark
Use "scrapy" to see available commands

I have the scrapy project I want to run in folder D:scrapymy1stscraper, how can I run that project, since D: drive is not available within my container?

UPDATE 3

A few months later when we discussed this, when I now run your proposed the Dockerfile it breaks and I now get this output:

PS D:Programs> docker build . -t scrapy
Sending build context to Docker daemon  1.644GB
Step 1/9 : FROM mcr.microsoft.com/windows/servercore:ltsc2019
 ---> d1724c2d9a84
Step 2/9 : SHELL ["powershell", "-Command", "$ErrorActionPreference = 'Stop'; $ProgressPreference = 'SilentlyContinue';"]
 ---> Running in 5f79f1bf9b62
Removing intermediate container 5f79f1bf9b62
 ---> 8bb2a477eaca
Step 3/9 : RUN setx /M PATH $('C:UsersContainerAdministratorminiconda3Librarybin;C:UsersContainerAdministratorminiconda3Scripts;C:UsersContainerAdministratorminiconda3;' + $Env:PATH)
 ---> Running in f3869c4f64d5

SUCCESS: Specified value was saved.
Removing intermediate container f3869c4f64d5
 ---> 82a2fa969a88
Step 4/9 : RUN Invoke-WebRequest "https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe" -OutFile miniconda3.exe -UseBasicParsing;     Start-Process -FilePath 'miniconda3.exe' -Wait -ArgumentList '/S', '/D=C:UsersContainerAdministratorminiconda3';     Remove-Item .miniconda3.exe;     conda install -y -c conda-forge scrapy;
 ---> Running in 3eb8b7bfe878
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.

Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed

UnsatisfiableError: The following specifications were found to be incompatible with the existing python installation in your environment:

Specifications:

  - scrapy -> python[version='2.7.*|3.5.*|3.6.*|>=2.7,<2.8.0a0|>=3.6,<3.7.0a0|>=3.7,<3.8.0a0|>=3.8,<3.9.0a0|>=3.5,<3.6.0a0|3.4.*']

Your python: python=3.9

If python is on the left-most side of the chain, that's the version you've asked for.
When python appears to the right, that indicates that the thing on the left is somehow
not available for the python version you are constrained to. Note that conda will not
change your python version to a different minor version unless you explicitly specify
that.

Not sure if I’m reading this correctly but it seems as if Scrapy does not support Python 3.9, except that here I see “Scrapy requires Python 3.6+” https://docs.scrapy.org/en/latest/intro/install.html Do you know what’s causing this issue? I also checked here but no answer yet either.

Advertisement

Answer

To run a containerised app, it must be installed in a container image first – you don’t want to install any software on the host machine.

For linux there are off-the-shelf container images for everything which is probably what your docker desktop environment was using; I see 1051 results on docker hub search for scrapy but none of them are windows containers.

The full process of creating a windows container from scratch for an app is:

  • Get steps to manually install the app (scrapy and its dependencies) on Windows Server – ideally test in a virtualised environment so you can reset it cleanly
  • Convert all steps to a fully automatic powershell script (e.g. for conda, need to download the installer via wget, execute the installer etc.
  • Optionaly, test the powershell steps in an interactive container
    • docker run -it --isolation=process mcr.microsoft.com/windows/servercore:ltsc2019 powershell
    • This runs a windows container and gives you a shell to verify that your install script works
    • When you exit the shell the container is stopped
  • Create a Dockerfile
    • Use mcr.microsoft.com/windows/servercore:ltsc2019 as the base image via FROM
    • Use the RUN command for each line of your powershell script

I tried installing scrapy on an existing windows Dockerfile that used conda / python 3.6, it threw error SettingsFrame has no attribute 'ENABLE_CONNECT_PROTOCOL' at a similar stage.

However I tried again with miniconda and python 3.8, and was able to get scrapy running, here’s the dockerfile:

FROM mcr.microsoft.com/windows/servercore:ltsc2019

SHELL ["powershell", "-Command", "$ErrorActionPreference = 'Stop'; $ProgressPreference = 'SilentlyContinue';"]

RUN setx /M PATH $('C:UsersContainerAdministratorminiconda3Librarybin;C:UsersContainerAdministratorminiconda3Scripts;C:UsersContainerAdministratorminiconda3;' + $Env:PATH)
RUN Invoke-WebRequest "https://repo.anaconda.com/miniconda/Miniconda3-py38_4.10.3-Windows-x86_64.exe" -OutFile miniconda3.exe -UseBasicParsing; 
    Start-Process -FilePath 'miniconda3.exe' -Wait -ArgumentList '/S', '/D=C:UsersContainerAdministratorminiconda3'; 
    Remove-Item .miniconda3.exe; 
    conda install -y -c conda-forge scrapy;

Build it with docker build . -t scrapy and run with docker run -it scrapy.

To verify you are running a shell inside the container run whoami – should return user managercontaineradministrator.

Then, scrapy benchmark command should be able to run and dump some stats. The container will stop when you close the shell.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement