Skip to content
Advertisement

Tag: apache-spark

Need help running spark-submit in Apache Airflow

I am a relatively new user to Python and Airflow and am having a very difficult time getting spark-submit to run in an Airflow task. My goal is to get the following DAG task to run successfully I know the problem lies with Airflow and not with the bash because when I run the command spark-submit –class CLASSPATH.CustomCreate ~/IdeaProjects/custom-create-job/build/libs/custom-create.jar in

How to create sparkmagic session automatically (without having to manually interact with widget user-interface)?

I am using sparkmagic to connect Jupyter notebooks to a remote spark cluster via Livy. The way it is now, I need to execute a notebook cell to bring up the %manage_spark user-interface widget, and manually select the language and click “create-session” in order to establish the spark context for the notebook. Is there a way to automatically generate the

Median and quantile values in Pyspark

In my dataframe I have an age column. The total number of rows are approx 77 billion. I want to calculate the quantile values of that column using PySpark. I have some code but the computation time is huge (maybe my process is very bad). Is there any good way to improve this? Dataframe example: What I have done so

PySpark 2.4 – Read CSV file with custom line separator

Support for custom line separators (for various text file formats) was added to spark in 2017 (see: https://github.com/apache/spark/pull/18581). … or maybe it wasn’t added in 2017 – or ever (see: https://github.com/apache/spark/pull/18304) Today, with Pyspark 2.4.0 I am unable to use custom line separators to parse CSV files. Here’s some code: Here’s two sample csv files: one.csv – lines are separated

PySpark 2.x: Programmatically adding Maven JAR Coordinates to Spark

The following is my PySpark startup snippet, which is pretty reliable (I’ve been using it a long time). Today I added the two Maven Coordinates shown in the spark.jars.packages option (effectively “plugging” in Kafka support). Now that normally triggers dependency downloads (performed by Spark automatically): However the plugins aren’t downloading and/or loading when I run the snippet (e.g. ./python -i

Spark Calculate Standard deviation row wise

I need to calculate Standard deviation row wise assuming that I already have a column with calculated mean per row.I tried this but I got the following error Answer Your code is completely mixed up (at its current state it won’t even cause the exception you described in the question). sqrt should be placed outside reduce call:

No FileSystem for scheme: s3 with pyspark

I’m trying to read a txt file from S3 with Spark, but I’m getting thhis error: This is my code: This is the full traceback: How can I fix this? Answer If you are using a local machine you can use boto3: (do not forget to setup your AWS S3 credentials). Another clean solution if you are using an AWS

Advertisement