Skip to content

Tag: pyspark

Pyspark agg max function showing different result

I was just studying some pyspark code and didnt understand these particular lines. I have a python code such as below: When showing empDF after Isn’t it supposed to show the longest list? It is showing [Python , R] as the output ? I dont understand how is this output coming? Answer Pyspark’s max f…

Parse multiple line CSV using PySpark , Python or Shell

Input (2 columns) : Note: Harry and Prof. does not have starting quotes Output (2 columns) What I tried (PySpark) ? Issue The above code worked fine where multiline had both start and end double quotes (For eg: row starting with Ronald) But it didnt work with rows where we only have end quotes but no start qu…

PySpark – Selecting all rows within each group

I have a dataframe similar to below. From the above dataframe, I would like to keep all rows upto the most recent sale relative to the date. So essentially, I will only have unique date for each row. In the case of above example, output would look like: Can you please guide on how can I go to this result?

SAS Proc Transpose to Pyspark

I am trying to convert a SAS proc transpose statement to pyspark in databricks. With the following data as a sample: I would expect the result to look like this I tried using the pandas pivot_table() function with the following code however I ran into some performance issues with the size of the data: Is ther…