Read avro files in pyspark with PyCharm

Question

I&#8217;m quite new to spark, I&#8217;ve imported pyspark library to pycharm venv and write below code: , everything seems to be okay but when I want to read avro file I get message: pyspark.sql.utils.AnalysisException: &#8216;Failed to find data source: avro. Avro is built-in but external data source module …

Accepted Answer

I downloaded the pyspark version 2.4.4 package from conda in PyCharm. And added spark-avro_2.11-2.4.4.jar file in spark configuration and was able to sucessfully recreate your error i.e, pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'To fix this issue, follow below steps:Uninstall pyspark package downloaded from conda.Download and unzip spark-2.4.4-bin-hadoop2.7.tgz from here.In Run > Environment Varibales, you should set SPARK_HOME to /spark-2.4.3-bin-hadoop2.7 and set PYTHONPATH to $SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/pythonDownload spark-avro_2.11-2.4.4.jar file from here.Now you should be able to run pyspark code from PyCharm. Try below code:# Importsfrom pyspark.sql import SparkSessionfrom pyspark import SparkConf, SparkContext#Create SparkSessionspark = SparkSession.builder .appName('DataFrame') .master('local[*]') .config('spark.jars', '/spark-avro_2.11-2.4.4.jar') .getOrCreate()df = spark.read.format('avro').load('/userdata1.avro')df.show()The above code will work but PyCharm will complain about pyspark modules. To remove that and enable code completion feature follow below additional steps:In Project Structure, click on Add Content root and add spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zipNow your project structure should look like:Output:

Advertisement

Answer