Skip to content
Advertisement

Tag: apache-spark

collect_list by preserving order based on another variable

I am trying to create a new column of lists in Pyspark using a groupby aggregation on existing set of columns. An example input data frame is provided below: The expected output is: The values within a list are sorted by the date. I tried using collect_list as follows: But collect_list doesn’t guarantee order even if I sort the input

How to start a standalone cluster using pyspark?

I am using pyspark under ubuntu with python 2.7 I installed it using And trying to follow the instruction to setup spark cluster I can’t find the script start-master.sh I assume that it has to do with the fact that i installed pyspark and not regular spark I found here that i can connect a worker node to the master

Rename nested field in spark dataframe

Having a dataframe df in Spark: How to rename field array_field.a to array_field.a_renamed? [Update]: .withColumnRenamed() does not work with nested fields so I tried this hacky and unsafe method: I know that setting a private attribute is not a good practice but I don’t know other way to set the schema for df I think I am on a right

pyspark, Compare two rows in dataframe

I’m attempting to compare one row in a dataframe with the next to see the difference in timestamp. Currently the data looks like: I’ve tried mapping a function onto the dataframe to allow for comparing like this: (note: I’m trying to get rows with a difference greater than 4 hours) But I’m getting the following error: Which I believe is

Advertisement