Tag: apache-spark

Spark: How to parse and transform json string from spark data frame rows

apache-spark apache-spark-sql pyspark python

How to parse and transform json string from spark dataframe rows in pyspark? I’m looking for help how to parse: json string to json struct output 1 transform json string to columns a, b and id output 2 Background: I get via API json strings with a large number of rows (jstr1, jstr2, …), which are saved to spark df.

Spark: How to transform to Data Frame data from multiple nested XML files with attributes

apache-spark apache-spark-xml pyspark python

How to transform values below from multiple XML files to spark data frame : attribute Id0 from Level_0 Date/Value from Level_4 Required output: file_1.xml: file_2.xml: Current Code Example: Current Output:(Id0 column with attributes missing) There are some examples, but non of them solve the problem: -I’m using databricks spark_xml – https://github.com/databricks/spark-xml -There is an examample but not with attribute reading,

PySpark “illegal reflective access operation” when executed in terminal

apache-spark pyspark python

I’ve installed Spark and components locally and I’m able to execute PySpark code in Jupyter, iPython and via spark-submit – however receiving the following WARNING’s: The .py file executes but should I be worried about these warnings? Don’t want to start writing some code to later find that it doesn’t execute down the line. FYI installed PySpark locally. Here’s the

How to read a gzip compressed json lines file into PySpark dataframe?

apache-spark apache-spark-sql pyspark python

I have a JSON-lines file that I wish to read into a PySpark data frame. the file is gzipped compressed. The filename looks like this: file.jl.gz I know how to read this file into a pandas data frame: I’m new to pyspark, and I’d like to learn the pyspark equivalent of this. Is there a way to read this file

spark-nlp ‘JavaPackage’ object is not callable

apache-spark johnsnowlabs-spark-nlp pyspark python python-3.x

I am using jupyter lab to run spark-nlp text analysis. At the moment I am just running the sample code: I get the following error: I read a few of the github issues developers raised in spark-nlp repo, but the fixes are not working for me. I am wondering if the use of pyenv is causing problems, but it works

How to calculate cumulative sum over date range excluding weekends in PySpark 2.0?

apache-spark pyspark python

This is an extension to an earlier question I raised here How to calculate difference between dates excluding weekends in PySpark 2.2.0. My spark dataframe looks like below and can be generated with the accompanying code: I am trying to calculate cumulative sums over a period of 2,3,4,5 & 30 days. Below is a sample code for 2 days and

Interpolation in PySpark throws java.lang.IllegalArgumentException

apache-spark pyspark python

I don’t know how to interpolate in PySpark when the DataFrame contains many columns. Let me xplain. I need to group by webID and interpolate counts values at 1 minute interval. However, when I apply the below-shown code, Error: Answer Set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1. https://spark.apache.org/docs/3.0.0-preview/sql-pyspark-pandas-with-arrow.html#compatibiliy-setting-for-pyarrow–0150-and-spark-23x-24x

How to get the N most recent dates in Pyspark

apache-spark apache-spark-sql pyspark python

Is there a way to get the most 30 recent days worth of records for each grouping of data in Pyspark? In this example, get the 2 records with the most recent dates within the groupings of (Grouping, Bucket). So a table like this Would turn into this: Edit: I reviewed my question after edit and realized that not doing

Most efficient way of transforming a date column to a timestamp column + an hour

apache-spark pyspark python

I want to know if there is a better way of transforming a date column into a datetime column + 1 hour than the method I am currently using. Here is my dataframe: My code: Which gives the output: Does anyone know a more efficient way of doing this. Casting to a timestamp twice seems a bit clumsy. Many thanks.

Read avro files in pyspark with PyCharm

apache-spark pycharm pyspark python

I’m quite new to spark, I’ve imported pyspark library to pycharm venv and write below code: , everything seems to be okay but when I want to read avro file I get message: pyspark.sql.utils.AnalysisException: ‘Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section