Skip to content
Advertisement

Read avro files in pyspark with PyCharm

I’m quite new to spark, I’ve imported pyspark library to pycharm venv and write below code:

JavaScript

, everything seems to be okay but when I want to read avro file I get message:

pyspark.sql.utils.AnalysisException: ‘Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of “Apache Avro Data Source Guide”.;’

When I go to this page: >https://spark.apache.org/docs/latest/sql-data-sources-avro.html there is something like this:

enter image description here

and I have no idea have to implement this, download something in PyCharm or you have to find external files to modify?

Thank you for help!

Update (2019-12-06): Because I’m using Anaconda I’ve opened Anaconda prompt and copied this code:

JavaScript

It downloaded some modules, then I’ve got back to PyCharm and same error appears.

Advertisement

Answer

I downloaded the pyspark version 2.4.4 package from conda in PyCharm. And added spark-avro_2.11-2.4.4.jar file in spark configuration and was able to sucessfully recreate your error i.e, pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'

To fix this issue, follow below steps:

  1. Uninstall pyspark package downloaded from conda.
  2. Download and unzip spark-2.4.4-bin-hadoop2.7.tgz from here.
  3. In Run > Environment Varibales, you should set SPARK_HOME to <download_path>/spark-2.4.3-bin-hadoop2.7 and set PYTHONPATH to $SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python
  4. Download spark-avro_2.11-2.4.4.jar file from here.

Now you should be able to run pyspark code from PyCharm. Try below code:

JavaScript

The above code will work but PyCharm will complain about pyspark modules. To remove that and enable code completion feature follow below additional steps:

  1. In Project Structure, click on Add Content root and add spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip

Now your project structure should look like:

enter image description here

Output: enter image description here

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement