I’m quite new to spark, I’ve imported pyspark library to pycharm venv and write below code:
# Imports from pyspark.sql import SparkSession # Create SparkSession spark = SparkSession.builder .appName('DataFrame') .master('local[*]') .getOrCreate() spark.conf.set("spark.sql.shuffle.partitions", 5) path = "file_path" df = spark.read.format("avro").load(path)
, everything seems to be okay but when I want to read avro file I get message:
pyspark.sql.utils.AnalysisException: ‘Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of “Apache Avro Data Source Guide”.;’
When I go to this page: >https://spark.apache.org/docs/latest/sql-data-sources-avro.html there is something like this:
and I have no idea have to implement this, download something in PyCharm or you have to find external files to modify?
Thank you for help!
Update (2019-12-06): Because I’m using Anaconda I’ve opened Anaconda prompt and copied this code:
pyspark --packages com.databricks:spark-avro_2.11:4.0.0
It downloaded some modules, then I’ve got back to PyCharm and same error appears.
Advertisement
Answer
I downloaded the pyspark
version 2.4.4
package from conda in PyCharm. And added spark-avro_2.11-2.4.4.jar
file in spark configuration and was able to sucessfully recreate your error i.e, pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
To fix this issue, follow below steps:
- Uninstall pyspark package downloaded from conda.
- Download and unzip
spark-2.4.4-bin-hadoop2.7.tgz
from here. - In Run > Environment Varibales, you should set
SPARK_HOME
to<download_path>/spark-2.4.3-bin-hadoop2.7
and setPYTHONPATH
to$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python
- Download
spark-avro_2.11-2.4.4.jar
file from here.
Now you should be able to run pyspark code from PyCharm. Try below code:
# Imports from pyspark.sql import SparkSession from pyspark import SparkConf, SparkContext #Create SparkSession spark = SparkSession.builder .appName('DataFrame') .master('local[*]') .config('spark.jars', '<path>/spark-avro_2.11-2.4.4.jar') .getOrCreate() df = spark.read.format('avro').load('<path>/userdata1.avro') df.show()
The above code will work but PyCharm will complain about pyspark modules. To remove that and enable code completion feature follow below additional steps:
- In Project Structure, click on Add Content root and add
spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip
Now your project structure should look like: