I have a Spark dataframe sdf with GPS points that looks like this: Since the spark dataframe contains different GPS trajectories generated by different users on different days, I want to write a function that loops through this df and feeds the corresponding set of coordinates to the (OSRM) request per date and per user group and not all at
Tag: apache-spark-sql
how to do multiplication of two pyspark dataframe row wise
I have below 2 pyspark dataframe df1 and df2 : I want to multiply df1 each row with the same column of df2 row. Final output be like Answer You can do a cross join and multiply the columns using a list comprehension:
Parse JSON string from Pyspark Dataframe
I have a nested JSON dict that I need to convert to spark dataframe. This JSON dict is present in a dataframe column. I have been trying to parse the dict present in dataframe column using “from_json” and “get_json_object”, but have been unable to read the data. Here’s the smallest snippet of the source data that I’ve been trying to
split a list of overlapping intervals into non overlapping subintervals in a pyspark dataframe
I have a pyspark dataframe that contains the columns start_time, end_time that define an interval per row. There is a column rate, and I want to know if there is not different values for a sub-interval (that is overlapped by definition); and if it is the case, I want to keep the last record as the ground truth. Inputs: Answer
Test of one dataframe in another
I have a pyspark dataframe df: and another smaller pyspark dataframe but with 3 rows with the same values, df2: Is there a way in pyspark to create a third boolean dataframe from the rows in df2 are in df? Such as: Many thanks in advance. Answer You can do a left join and assign False if all columns joined
How to obtain row percentages of crosstab from a spark dataframe using python?
I used python code: to create a crosstab from a spark dataframe as follows: However, I cannot find a code to obtain the row percentages. For example, age 18 row percentages should be 5/12 = 41.7% for ‘no’ and 7/12 = 58.3% for ‘yes’. The sum of 2 percentages is 100%. May someone advise me in this case? Many thanks
Create a new column by replacing comma-separated column’s values with a lookup based on another dataframe
I have PySpark dataframe (source_df) in which there is a column with values that are comma-separated. I am trying to replace those values with a lookup based on another dataframe (lookup_df) source_df lookup_df output dataframe: Column A is a primary key and is always unique. Column T is unique for a given value of A. Answer You can split and
How to transpose a dataframe in pyspark?
How do I transpose columns in Pyspark? I want to make columns become rows, and rows become the columns. Here is the input: Expected Outcome: Answer You can combine stack function to unpivot vin, mean and cur columns then pivot column idx: You apply the transformation one by one to see how it works and what do each part.
Spark: How to parse JSON string of nested lists to spark data frame?
How to parse JSON string of nested lists to spark data frame in pyspark ? Input data frame: Expected output: Example code: There are few examples, but I can not figure out how to do it: How to parse and transform json string from spark data frame rows in pyspark How to transform JSON string with multiple keys, from spark
How to apply condition in PySpark to keep null only if one else remove nulls
Condition: If ID has a Score ‘High’ or ‘Mid’ -> remove None If ID only has Score None -> just keep None Input: ID Score AAA High AAA Mid AAA None BBB None Desired output: ID Score AAA High AAA Mid BBB None I’m having difficulty in writing the if condition in PySpark. Is there any other way to tackle