Tag: apache-spark-sql

Spark: How to parse and transform json string from spark data frame rows

apache-spark apache-spark-sql pyspark python

How to parse and transform json string from spark dataframe rows in pyspark? I’m looking for help how to parse: json string to json struct output 1 transform json string to columns a, b and id output 2 Background: I get via API json strings with a large number of rows (jstr1, jstr2, …), which are saved to spark df.

How to read a gzip compressed json lines file into PySpark dataframe?

apache-spark apache-spark-sql pyspark python

I have a JSON-lines file that I wish to read into a PySpark data frame. the file is gzipped compressed. The filename looks like this: file.jl.gz I know how to read this file into a pandas data frame: I’m new to pyspark, and I’d like to learn the pyspark equivalent of this. Is there a way to read this file

How to get the N most recent dates in Pyspark

apache-spark apache-spark-sql pyspark python

Is there a way to get the most 30 recent days worth of records for each grouping of data in Pyspark? In this example, get the 2 records with the most recent dates within the groupings of (Grouping, Bucket). So a table like this Would turn into this: Edit: I reviewed my question after edit and realized that not doing

Pyspark: How to code Complicated Dataframe algorithm problem (summing with condition)

apache-spark-sql pyspark python

I have a dataframe looks like this: date : sorted nicely Trigger : only T or F value : any random decimal (float) value col1 : represents number of days and can not be lower than -1.** -1<= col1 < infinity** col2 : represents number of days and cannot be negative. col2 >= 0 **Calculation logic ** If col1 ==

Removing non-ascii and special character in pyspark dataframe column

apache-spark-sql azure-databricks pyspark python

I am reading data from csv files which has about 50 columns, few of the columns(4 to 5) contain text data with non-ASCII characters and special characters. I am trying to remove all the non-Ascii and special characters and keep only English characters, and I tried to do it as below There are no spaces in my column name. I

I need to append only those who has non null values in pyspark dataframe

apache-spark-sql arrays pyspark python

I am having the pyspark dataframe (df) having below sample table (table1): id, col1, col2, col3 1, abc, null, def 2, null, def, abc 3, def, abc, null I am trying to get new column (final) by appending the all the columns by ignoring null values. I have tried pyspark code and used f.array(col1, col2, col3). Values are getting appended

Extract multiple words using regexp_extract in PySpark

apache-spark apache-spark-sql pyspark python

I have a list which contains some words and I need to extract matching words from a text line, I found this, but it only extracts one word. keys file content this is a keyword part_description file content 32015 this is a keyword hello world Code Outputs Expected output I want to return all matching keyword and their count and

Median and quantile values in Pyspark

apache-spark apache-spark-sql pyspark python

In my dataframe I have an age column. The total number of rows are approx 77 billion. I want to calculate the quantile values of that column using PySpark. I have some code but the computation time is huge (maybe my process is very bad). Is there any good way to improve this? Dataframe example: What I have done so

PySpark: filtering with isin returns empty dataframe

apache-spark apache-spark-sql pyspark python

Context: I need to filter a dataframe based on what contains another dataframe’s column using the isin function. For Python users working with pandas, that would be: isin(). For R users, that would be: %in%. So I have a simple spark dataframe with id and value columns: I want to get all ids that appear multiple times. Here’s a dataframe

What are alternative methods for pandas quantile and cut in pyspark 1.6

apache-spark-sql pandas pyspark python

I’m newbie to pyspark. I have pandas code like below. I have found ‘approxQuantile’ in pyspark 2.x but I didn’t find any such in pyspark 1.6.0 My sample input: df.show() df.collect() I have to loop the above logic for all input columns. Could anyone please suggest how to rewrite above code in pyspark 1.6 dataframe. Thanks in advance Answer If