I currently have an AWS EMR with a linked notebook to that same cluster.
I would like to load a spacy model (en_core_web_sm
) but first I need to download the model which is usually done using python -m spacy download en_core_web_sm
but I really can’t find how to do it in a PySpark Session.
Here is my config :
%%configure -f { "name":"conf0", "kind": "pyspark", "conf":{ "spark.pyspark.python": "python", "spark.pyspark.virtualenv.enabled": "true", "spark.pyspark.virtualenv.type":"native", "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv" }, "files":["s3://my-s3/code/utils/NLPtools.py", "s3://my-s3/code/utils/Parse_wikidump.py", "s3://my-s3/code/utils/S3_access.py", "s3://my-s3/code/utils/myeval.py", "s3://my-s3/code/utils/rank_metrics.py", "s3://my-s3/code/utils/removeoutput.py", "s3://my-s3/code/utils/secret_manager.py", "s3://my-s3/code/utils/word2vec.py"] }
I’m able to run such command, which is kinda normal :
sc.install_pypi_package("boto3") sc.install_pypi_package("pandas") sc.install_pypi_package("hdfs") sc.install_pypi_package("NLPtools") sc.install_pypi_package("numpy") sc.install_pypi_package("tqdm") sc.install_pypi_package("wikipedia") sc.install_pypi_package("filechunkio") sc.install_pypi_package("thinc") sc.install_pypi_package("gensim") sc.install_pypi_package("termcolor") sc.install_pypi_package("boto") sc.install_pypi_package("spacy") sc.install_pypi_package("langdetect") sc.install_pypi_package("pathos")
But of course, like I can’t succeed to download the model, when trying to load it I got the following error :
An error was encountered: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory. Traceback (most recent call last): File "/mnt/tmp/spark-eef27750-07a4-4a8a-82dc-b006827e7f1f/userFiles-ec6ecbe3-558b-42df-bd38-cd33b2340ae0/NLPtools.py", line 13, in <module> nlp = spacy.load('en_core_web_sm', disable=['parser', 'textcat']) File "/tmp/1596550154785-0/lib/python2.7/site-packages/spacy/__init__.py", line 30, in load return util.load_model(name, **overrides) File "/tmp/1596550154785-0/lib/python2.7/site-packages/spacy/util.py", line 175, in load_model raise IOError(Errors.E050.format(name=name)) IOError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
I’ve tried to install it directly on the cluster (master/worker) but it’s “outside” the PySpark Session so it won’t be embedded. And commands like !python -m spacy download en_core_web_sm
are not working in a PySpark notebook…
Thanks in advance !
Advertisement
Answer
The best way to install spacy and models is to use EMR bootsrap scripts. This one works me.
My configuration:
Release label:emr-5.32.0 Hadoop distribution:Amazon 2.10.1 Applications:Spark 2.4.7 JupyterEnterpriseGateway 2.1.0 Livy 0.7.0
My script:
#!/bin/bash -xe #### WARNING ##### ## After modifying this script you have to push it on s3 # Non-standard and non-Amazon Machine Image Python modules: version=1.1 printf "This is the latest script $version" sudo /usr/bin/pip3.7 install -U boto3 pandas langdetect hdfs tqdm pathos wikipedia filechunkio gensim termcolor awswrangler # Install spacy. Order matter ! sudo /usr/bin/pip3.7 install -U numpy Cython pip sudo /usr/local/bin/pip3.7 install -U spacy sudo python3 -m spacy download en_core_web_sm
Two important points to notice:
- Use sudo for all commands
- Upgrade pip and change path after it