Tag: apache-spark

Need help running spark-submit in Apache Airflow

airflow apache-spark bash python spark-submit

I am a relatively new user to Python and Airflow and am having a very difficult time getting spark-submit to run in an Airflow task. My goal is to get the following DAG task to run successfully I know the problem lies with Airflow and not with the bash because when I run the command spark-submit –class …

How to create sparkmagic session automatically (without having to manually interact with widget user-interface)?

apache-spark jupyter python

I am using sparkmagic to connect Jupyter notebooks to a remote spark cluster via Livy. The way it is now, I need to execute a notebook cell to bring up the %manage_spark user-interface widget, and manually select the language and click “create-session” in order to establish the spark context for t…

Extract multiple words using regexp_extract in PySpark

apache-spark apache-spark-sql pyspark python

I have a list which contains some words and I need to extract matching words from a text line, I found this, but it only extracts one word. keys file content this is a keyword part_description file content 32015 this is a keyword hello world Code Outputs Expected output I want to return all matching keyword a…

Median and quantile values in Pyspark

apache-spark apache-spark-sql pyspark python

In my dataframe I have an age column. The total number of rows are approx 77 billion. I want to calculate the quantile values of that column using PySpark. I have some code but the computation time is huge (maybe my process is very bad). Is there any good way to improve this? Dataframe example: What I have do…

PySpark: filtering with isin returns empty dataframe

apache-spark apache-spark-sql pyspark python

Context: I need to filter a dataframe based on what contains another dataframe’s column using the isin function. For Python users working with pandas, that would be: isin(). For R users, that would be: %in%. So I have a simple spark dataframe with id and value columns: I want to get all ids that appear …

PySpark 2.4 – Read CSV file with custom line separator

apache-spark csv pyspark python text-parsing

Support for custom line separators (for various text file formats) was added to spark in 2017 (see: https://github.com/apache/spark/pull/18581). … or maybe it wasn’t added in 2017 – or ever (see: https://github.com/apache/spark/pull/18304) Today, with Pyspark 2.4.0 I am unable to use custom …

PySpark 2.x: Programmatically adding Maven JAR Coordinates to Spark

apache-kafka apache-spark maven pyspark python

The following is my PySpark startup snippet, which is pretty reliable (I’ve been using it a long time). Today I added the two Maven Coordinates shown in the spark.jars.packages option (effectively “plugging” in Kafka support). Now that normally triggers dependency downloads (performed by Spa…

Spark Calculate Standard deviation row wise

apache-spark python

I need to calculate Standard deviation row wise assuming that I already have a column with calculated mean per row.I tried this but I got the following error Answer Your code is completely mixed up (at its current state it won’t even cause the exception you described in the question). sqrt should be pla…

environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON

apache-spark pyspark python python-3.x

I have installed pyspark recently. It was installed correctly. When I am using following simple program in python, I am getting an error. while running the last line I am getting error whose key line seems to be I have the following variables in .bashrc I am using Python 3. Answer By the way, if you use PyCha…

No FileSystem for scheme: s3 with pyspark

apache-spark python python-2.7

I’m trying to read a txt file from S3 with Spark, but I’m getting thhis error: This is my code: This is the full traceback: How can I fix this? Answer If you are using a local machine you can use boto3: (do not forget to setup your AWS S3 credentials). Another clean solution if you are using an AW…