I am using jupyter lab to run spark-nlp text analysis. At the moment I am just running the sample code:
import sparknlp from pyspark.sql import SparkSession from sparknlp.pretrained import PretrainedPipeline #create or get Spark Session #spark = sparknlp.start() spark = SparkSession.builder .appName("ner") .master("local[4]") .config("spark.driver.memory","8G") .config("spark.driver.maxResultSize", "2G") .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.6.5") .config("spark.kryoserializer.buffer.max", "500m") .getOrCreate() print("sparknlp version", sparknlp.version(), "sparkversion", spark.version) #download, load, and annotate a text by pre-trained pipeline pipeline = PretrainedPipeline('recognize_entities_dl', 'en') result = pipeline.annotate('Harry Potter is a great movie')
I get the following error:
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-4-bfd6884be04c> in <module> 15 16 #download, load, and annotate a text by pre-trained pipeline ---> 17 pipeline = PretrainedPipeline('recognize_entities_dl', 'en') 18 result = pipeline.annotate('Harry Potter is a great movie') ~/.pyenv/versions/3.7.9/lib/python3.7/site-packages/sparknlp/pretrained.py in __init__(self, name, lang, remote_loc, parse_embeddings, disk_location) 89 def __init__(self, name, lang='en', remote_loc=None, parse_embeddings=False, disk_location=None): 90 if not disk_location: ---> 91 self.model = ResourceDownloader().downloadPipeline(name, lang, remote_loc) 92 else: 93 self.model = PipelineModel.load(disk_location) ~/.pyenv/versions/3.7.9/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadPipeline(name, language, remote_loc) 49 def downloadPipeline(name, language, remote_loc=None): 50 print(name + " download started this may take some time.") ---> 51 file_size = _internal._GetResourceSize(name, language, remote_loc).apply() 52 if file_size == "-1": 53 print("Can not find the model to download please check the name!") ~/.pyenv/versions/3.7.9/lib/python3.7/site-packages/sparknlp/internal.py in __init__(self, name, language, remote_loc) 190 def __init__(self, name, language, remote_loc): 191 super(_GetResourceSize, self).__init__( --> 192 "com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize", name, language, remote_loc) 193 194 ~/.pyenv/versions/3.7.9/lib/python3.7/site-packages/sparknlp/internal.py in __init__(self, java_obj, *args) 127 super(ExtendedJavaWrapper, self).__init__(java_obj) 128 self.sc = SparkContext._active_spark_context --> 129 self._java_obj = self.new_java_obj(java_obj, *args) 130 self.java_obj = self._java_obj 131 ~/.pyenv/versions/3.7.9/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args) 137 138 def new_java_obj(self, java_class, *args): --> 139 return self._new_java_obj(java_class, *args) 140 141 def new_java_array(self, pylist, java_class): ~/.pyenv/versions/3.7.9/lib/python3.7/site-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args) 67 java_obj = getattr(java_obj, name) 68 java_args = [_py2java(sc, arg) for arg in args] ---> 69 return java_obj(*java_args) 70 71 @staticmethod TypeError: 'JavaPackage' object is not callable
I read a few of the github issues developers raised in spark-nlp repo, but the fixes are not working for me. I am wondering if the use of pyenv is causing problems, but it works for everything else.
My jupyter lab is launched like so:
/home/myuser/.pyenv/shims/jupyter lab --no-browser --allow-root --notebook-dir /home/myuser/workdir/notebooks
My env configuration:
ubuntu: 20.10
Apache Spark: 3.0.1
pyspark: 2.4.4
spark-nlp: 2.6.5
pyenv: 1.2.21
Java:
openjdk 11.0.9 2020-10-20 OpenJDK Runtime Environment (build 11.0.9+10-post-Ubuntu-0ubuntu1) OpenJDK 64-Bit Server VM (build 11.0.9+10-post-Ubuntu-0ubuntu1, mixed mode, sharing)
jupyter:
jupyter core : 4.7.0 jupyter-notebook : 6.1.5 qtconsole : 5.0.1 ipython : 7.19.0 ipykernel : 5.4.2 jupyter client : 6.1.7 jupyter lab : 2.2.9 nbconvert : 6.0.7 ipywidgets : 7.5.1 nbformat : 5.0.8 traitlets : 5.0.5
I appreciate your help .. thank you
Advertisement
Answer
Remove Spark 3.0.1, leave just PySpark 2.4.x. as Spark NLP still doesn’t support Spark 3.x. Use Java 8 instead of Java 11 because it’s not supported in Spark 2.4.