Skip to content
Advertisement

Tag: pyspark

How to start a standalone cluster using pyspark?

I am using pyspark under ubuntu with python 2.7 I installed it using And trying to follow the instruction to setup spark cluster I can’t find the script start-master.sh I assume that it has to do with the fact that i installed pyspark and not regular spark I found here that i can connect a worker node to the master

Rename nested field in spark dataframe

Having a dataframe df in Spark: How to rename field array_field.a to array_field.a_renamed? [Update]: .withColumnRenamed() does not work with nested fields so I tried this hacky and unsafe method: I know that setting a private attribute is not a good practice but I don’t know other way to set the schema for df I think I am on a right

pyspark, Compare two rows in dataframe

I’m attempting to compare one row in a dataframe with the next to see the difference in timestamp. Currently the data looks like: I’ve tried mapping a function onto the dataframe to allow for comparing like this: (note: I’m trying to get rows with a difference greater than 4 hours) But I’m getting the following error: Which I believe is

Advertisement