Skip to content
Advertisement

I can’t install spacy model in EMR PySpark notebook

I currently have an AWS EMR with a linked notebook to that same cluster.

I would like to load a spacy model (en_core_web_sm) but first I need to download the model which is usually done using python -m spacy download en_core_web_sm but I really can’t find how to do it in a PySpark Session.

Here is my config :

%%configure -f
{
    "name":"conf0",
    "kind": "pyspark",
    "conf":{
          "spark.pyspark.python": "python",
          "spark.pyspark.virtualenv.enabled": "true",
          "spark.pyspark.virtualenv.type":"native",
          "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
    },
    "files":["s3://my-s3/code/utils/NLPtools.py",
            "s3://my-s3/code/utils/Parse_wikidump.py",
            "s3://my-s3/code/utils/S3_access.py",
            "s3://my-s3/code/utils/myeval.py",
            "s3://my-s3/code/utils/rank_metrics.py",
            "s3://my-s3/code/utils/removeoutput.py",
            "s3://my-s3/code/utils/secret_manager.py",
            "s3://my-s3/code/utils/word2vec.py"]
}

I’m able to run such command, which is kinda normal :

sc.install_pypi_package("boto3")
sc.install_pypi_package("pandas")
sc.install_pypi_package("hdfs")
sc.install_pypi_package("NLPtools")
sc.install_pypi_package("numpy")
sc.install_pypi_package("tqdm")
sc.install_pypi_package("wikipedia")
sc.install_pypi_package("filechunkio")
sc.install_pypi_package("thinc")
sc.install_pypi_package("gensim")
sc.install_pypi_package("termcolor")
sc.install_pypi_package("boto")
sc.install_pypi_package("spacy")
sc.install_pypi_package("langdetect")
sc.install_pypi_package("pathos")

But of course, like I can’t succeed to download the model, when trying to load it I got the following error :

An error was encountered:
[E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
Traceback (most recent call last):
  File "/mnt/tmp/spark-eef27750-07a4-4a8a-82dc-b006827e7f1f/userFiles-ec6ecbe3-558b-42df-bd38-cd33b2340ae0/NLPtools.py", line 13, in <module>
    nlp = spacy.load('en_core_web_sm', disable=['parser', 'textcat'])
  File "/tmp/1596550154785-0/lib/python2.7/site-packages/spacy/__init__.py", line 30, in load
    return util.load_model(name, **overrides)
  File "/tmp/1596550154785-0/lib/python2.7/site-packages/spacy/util.py", line 175, in load_model
    raise IOError(Errors.E050.format(name=name))
IOError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

I’ve tried to install it directly on the cluster (master/worker) but it’s “outside” the PySpark Session so it won’t be embedded. And commands like !python -m spacy download en_core_web_sm are not working in a PySpark notebook…

Thanks in advance !

Advertisement

Answer

The best way to install spacy and models is to use EMR bootsrap scripts. This one works me.

My configuration:

Release label:emr-5.32.0 
Hadoop distribution:Amazon 2.10.1
Applications:Spark 2.4.7 
JupyterEnterpriseGateway 2.1.0 
Livy 0.7.0

My script:

#!/bin/bash -xe

#### WARNING #####
## After modifying this script you have to push it on s3

# Non-standard and non-Amazon Machine Image Python modules:
version=1.1

printf "This is the latest script $version"

sudo /usr/bin/pip3.7 install -U 
  boto3 
  pandas 
  langdetect 
  hdfs 
  tqdm 
  pathos 
  wikipedia 
  filechunkio 
  gensim 
  termcolor 
  awswrangler

# Install spacy. Order matter !
sudo /usr/bin/pip3.7 install -U 
  numpy 
  Cython 
  pip

sudo /usr/local/bin/pip3.7 install -U spacy

sudo python3 -m spacy download en_core_web_sm

Two important points to notice:

  • Use sudo for all commands
  • Upgrade pip and change path after it
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement