Tag: pyspark

Test of one dataframe in another

apache-spark apache-spark-sql dataframe pyspark python

I have a pyspark dataframe df: and another smaller pyspark dataframe but with 3 rows with the same values, df2: Is there a way in pyspark to create a third boolean dataframe from the rows in df2 are in df? Such as: Many thanks in advance. Answer You can do a left join and assign False if all columns joined

How to obtain row percentages of crosstab from a spark dataframe using python?

apache-spark apache-spark-sql pyspark python

I used python code: to create a crosstab from a spark dataframe as follows: However, I cannot find a code to obtain the row percentages. For example, age 18 row percentages should be 5/12 = 41.7% for ‘no’ and 7/12 = 58.3% for ‘yes’. The sum of 2 percentages is 100%. May someone advise me in this case? Many thanks

Create a new column by replacing comma-separated column’s values with a lookup based on another dataframe

apache-spark apache-spark-sql pyspark python

I have PySpark dataframe (source_df) in which there is a column with values that are comma-separated. I am trying to replace those values with a lookup based on another dataframe (lookup_df) source_df lookup_df output dataframe: Column A is a primary key and is always unique. Column T is unique for a given value of A. Answer You can split and

How to transpose a dataframe in pyspark?

apache-spark apache-spark-sql pyspark python

How do I transpose columns in Pyspark? I want to make columns become rows, and rows become the columns. Here is the input: Expected Outcome: Answer You can combine stack function to unpivot vin, mean and cur columns then pivot column idx: You apply the transformation one by one to see how it works and what do each part.

Spark: How to parse JSON string of nested lists to spark data frame?

apache-spark apache-spark-sql pyspark python

How to parse JSON string of nested lists to spark data frame in pyspark ? Input data frame: Expected output: Example code: There are few examples, but I can not figure out how to do it: How to parse and transform json string from spark data frame rows in pyspark How to transform JSON string with multiple keys, from spark

How to apply condition in PySpark to keep null only if one else remove nulls

apache-spark apache-spark-sql pyspark python

Condition: If ID has a Score ‘High’ or ‘Mid’ -> remove None If ID only has Score None -> just keep None Input: ID Score AAA High AAA Mid AAA None BBB None Desired output: ID Score AAA High AAA Mid BBB None I’m having difficulty in writing the if condition in PySpark. Is there any other way to tackle

Spark: How to parse and transform json string from spark data frame rows

apache-spark apache-spark-sql pyspark python

How to parse and transform json string from spark dataframe rows in pyspark? I’m looking for help how to parse: json string to json struct output 1 transform json string to columns a, b and id output 2 Background: I get via API json strings with a large number of rows (jstr1, jstr2, …), which are saved to spark df.

Spark: How to transform to Data Frame data from multiple nested XML files with attributes

apache-spark apache-spark-xml pyspark python

How to transform values below from multiple XML files to spark data frame : attribute Id0 from Level_0 Date/Value from Level_4 Required output: file_1.xml: file_2.xml: Current Code Example: Current Output:(Id0 column with attributes missing) There are some examples, but non of them solve the problem: -I’m using databricks spark_xml – https://github.com/databricks/spark-xml -There is an examample but not with attribute reading,

PySpark “illegal reflective access operation” when executed in terminal

apache-spark pyspark python

I’ve installed Spark and components locally and I’m able to execute PySpark code in Jupyter, iPython and via spark-submit – however receiving the following WARNING’s: The .py file executes but should I be worried about these warnings? Don’t want to start writing some code to later find that it doesn’t execute down the line. FYI installed PySpark locally. Here’s the

How to read a gzip compressed json lines file into PySpark dataframe?

apache-spark apache-spark-sql pyspark python

I have a JSON-lines file that I wish to read into a PySpark data frame. the file is gzipped compressed. The filename looks like this: file.jl.gz I know how to read this file into a pandas data frame: I’m new to pyspark, and I’d like to learn the pyspark equivalent of this. Is there a way to read this file