Skip to content
Advertisement

I can’t install spacy model in EMR PySpark notebook

I currently have an AWS EMR with a linked notebook to that same cluster.

I would like to load a spacy model (en_core_web_sm) but first I need to download the model which is usually done using python -m spacy download en_core_web_sm but I really can’t find how to do it in a PySpark Session.

Here is my config :

JavaScript

I’m able to run such command, which is kinda normal :

JavaScript

But of course, like I can’t succeed to download the model, when trying to load it I got the following error :

JavaScript

I’ve tried to install it directly on the cluster (master/worker) but it’s “outside” the PySpark Session so it won’t be embedded. And commands like !python -m spacy download en_core_web_sm are not working in a PySpark notebook…

Thanks in advance !

Advertisement

Answer

The best way to install spacy and models is to use EMR bootsrap scripts. This one works me.

My configuration:

JavaScript

My script:

JavaScript

Two important points to notice:

  • Use sudo for all commands
  • Upgrade pip and change path after it
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement