How to parse and transform json string from spark dataframe rows in pyspark? I’m looking for help how to parse: json string to json struct output 1 transform json string to columns a, b and id output 2 Background: I get via API json strings with a large number of rows (jstr1, jstr2, …), which are …
Tag: apache-spark
Spark: How to transform to Data Frame data from multiple nested XML files with attributes
How to transform values below from multiple XML files to spark data frame : attribute Id0 from Level_0 Date/Value from Level_4 Required output: file_1.xml: file_2.xml: Current Code Example: Current Output:(Id0 column with attributes missing) There are some examples, but non of them solve the problem: -I’…
PySpark “illegal reflective access operation” when executed in terminal
I’ve installed Spark and components locally and I’m able to execute PySpark code in Jupyter, iPython and via spark-submit – however receiving the following WARNING’s: The .py file executes but should I be worried about these warnings? Don’t want to start writing some code to late…
How to read a gzip compressed json lines file into PySpark dataframe?
I have a JSON-lines file that I wish to read into a PySpark data frame. the file is gzipped compressed. The filename looks like this: file.jl.gz I know how to read this file into a pandas data frame: I’m new to pyspark, and I’d like to learn the pyspark equivalent of this. Is there a way to read t…
spark-nlp ‘JavaPackage’ object is not callable
I am using jupyter lab to run spark-nlp text analysis. At the moment I am just running the sample code: I get the following error: I read a few of the github issues developers raised in spark-nlp repo, but the fixes are not working for me. I am wondering if the use of pyenv is causing problems, but it works
How to calculate cumulative sum over date range excluding weekends in PySpark 2.0?
This is an extension to an earlier question I raised here How to calculate difference between dates excluding weekends in PySpark 2.2.0. My spark dataframe looks like below and can be generated with the accompanying code: I am trying to calculate cumulative sums over a period of 2,3,4,5 & 30 days. Below i…
Interpolation in PySpark throws java.lang.IllegalArgumentException
I don’t know how to interpolate in PySpark when the DataFrame contains many columns. Let me xplain. I need to group by webID and interpolate counts values at 1 minute interval. However, when I apply the below-shown code, Error: Answer Set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1. https://spa…
How to get the N most recent dates in Pyspark
Is there a way to get the most 30 recent days worth of records for each grouping of data in Pyspark? In this example, get the 2 records with the most recent dates within the groupings of (Grouping, Bucket). So a table like this Would turn into this: Edit: I reviewed my question after edit and realized that no…
Most efficient way of transforming a date column to a timestamp column + an hour
I want to know if there is a better way of transforming a date column into a datetime column + 1 hour than the method I am currently using. Here is my dataframe: My code: Which gives the output: Does anyone know a more efficient way of doing this. Casting to a timestamp twice seems a bit clumsy. Many thanks.
Read avro files in pyspark with PyCharm
I’m quite new to spark, I’ve imported pyspark library to pycharm venv and write below code: , everything seems to be okay but when I want to read avro file I get message: pyspark.sql.utils.AnalysisException: ‘Failed to find data source: avro. Avro is built-in but external data source module …