Tag: apache-spark

how to do multiplication of two pyspark dataframe row wise

apache-spark apache-spark-sql pyspark python

I have below 2 pyspark dataframe df1 and df2 : I want to multiply df1 each row with the same column of df2 row. Final output be like Answer You can do a cross join and multiply the columns using a list comprehension:

convert date month year time to date format pyspark

apache-spark pyspark python

I have a file with timestamp column. When I try to read the file with a schema designed by myself it is populating the datetime column with null. Source file has data as below where I am using the below code snippet in the above DF.display() is showing the result as null for all the inputs. However my expected output

Parse JSON string from Pyspark Dataframe

apache-spark apache-spark-sql json pyspark python

I have a nested JSON dict that I need to convert to spark dataframe. This JSON dict is present in a dataframe column. I have been trying to parse the dict present in dataframe column using “from_json” and “get_json_object”, but have been unable to read the data. Here’s the smallest snippet of the source data that I’ve been trying to

split a list of overlapping intervals into non overlapping subintervals in a pyspark dataframe

apache-spark apache-spark-sql pyspark python

I have a pyspark dataframe that contains the columns start_time, end_time that define an interval per row. There is a column rate, and I want to know if there is not different values for a sub-interval (that is overlapped by definition); and if it is the case, I want to keep the last record as the ground truth. Inputs: Answer

Test of one dataframe in another

apache-spark apache-spark-sql dataframe pyspark python

I have a pyspark dataframe df: and another smaller pyspark dataframe but with 3 rows with the same values, df2: Is there a way in pyspark to create a third boolean dataframe from the rows in df2 are in df? Such as: Many thanks in advance. Answer You can do a left join and assign False if all columns joined

How to obtain row percentages of crosstab from a spark dataframe using python?

apache-spark apache-spark-sql pyspark python

I used python code: to create a crosstab from a spark dataframe as follows: However, I cannot find a code to obtain the row percentages. For example, age 18 row percentages should be 5/12 = 41.7% for ‘no’ and 7/12 = 58.3% for ‘yes’. The sum of 2 percentages is 100%. May someone advise me in this case? Many thanks

Create a new column by replacing comma-separated column’s values with a lookup based on another dataframe

apache-spark apache-spark-sql pyspark python

I have PySpark dataframe (source_df) in which there is a column with values that are comma-separated. I am trying to replace those values with a lookup based on another dataframe (lookup_df) source_df lookup_df output dataframe: Column A is a primary key and is always unique. Column T is unique for a given value of A. Answer You can split and

How to transpose a dataframe in pyspark?

apache-spark apache-spark-sql pyspark python

How do I transpose columns in Pyspark? I want to make columns become rows, and rows become the columns. Here is the input: Expected Outcome: Answer You can combine stack function to unpivot vin, mean and cur columns then pivot column idx: You apply the transformation one by one to see how it works and what do each part.

Spark: How to parse JSON string of nested lists to spark data frame?

apache-spark apache-spark-sql pyspark python

How to parse JSON string of nested lists to spark data frame in pyspark ? Input data frame: Expected output: Example code: There are few examples, but I can not figure out how to do it: How to parse and transform json string from spark data frame rows in pyspark How to transform JSON string with multiple keys, from spark

How to apply condition in PySpark to keep null only if one else remove nulls

apache-spark apache-spark-sql pyspark python

Condition: If ID has a Score ‘High’ or ‘Mid’ -> remove None If ID only has Score None -> just keep None Input: ID Score AAA High AAA Mid AAA None BBB None Desired output: ID Score AAA High AAA Mid BBB None I’m having difficulty in writing the if condition in PySpark. Is there any other way to tackle