I am trying to create a new column of lists in Pyspark using a groupby aggregation on existing set of columns. An example input data frame is provided below: The expected output is: The values within a list are sorted by the date. I tried using collect_list as follows: But collect_list doesn’t guarantee order even if I sort the input
Tag: apache-spark
PySpark reversing StringIndexer in nested array
I’m using PySpark to do collaborative filtering using ALS. My original user and item id’s are strings, so I used StringIndexer to convert them to numeric indices (PySpark’s ALS model obliges us to do so). After I’ve fitted the model, I can get the top 3 recommendations for each user like so: The recs dataframe looks like so: I want
How to start a standalone cluster using pyspark?
I am using pyspark under ubuntu with python 2.7 I installed it using And trying to follow the instruction to setup spark cluster I can’t find the script start-master.sh I assume that it has to do with the fact that i installed pyspark and not regular spark I found here that i can connect a worker node to the master
Selecting only numeric/string columns names from a Spark DF in pyspark
I have a Spark DataFrame in Pyspark (2.1.0) and I am looking to get the names of numeric columns only or string columns only. For example, this is the Schema of my DF: This is what I need: How can I make it? Answer dtypes is list of tuples (columnNane,type) you can use simple filter
PySpark: Get first Non-null value of each column in dataframe
I’m dealing with different Spark DataFrames, which have lot of Null values in many columns. I want to get any one non-null value from each of the column to see if that value can be converted to datetime. I tried doing df.na.drop().first() in a hope that it’ll drop all rows with any null value, and of the remaining DataFrame, I’ll
Rename nested field in spark dataframe
Having a dataframe df in Spark: How to rename field array_field.a to array_field.a_renamed? [Update]: .withColumnRenamed() does not work with nested fields so I tried this hacky and unsafe method: I know that setting a private attribute is not a good practice but I don’t know other way to set the schema for df I think I am on a right
How to run Spark code in Airflow?
Hello people of the Earth! I’m using Airflow to schedule and run Spark tasks. All I found by this time is python DAGs that Airflow can manage. DAG example: The problem is I’m not good in Python code and have some tasks written in Java. My question is how to run Spark Java jar in python DAG? Or maybe there
How to select last row and also how to access PySpark dataframe by index?
From a PySpark SQL dataframe like How to get the last row.(Like by df.limit(1) I can get first row of dataframe into new dataframe). And how can I access the dataframe rows by index.like row no. 12 or 200 . In pandas I can do I am just curious how to access pyspark dataframe in such ways or alternative ways.
Retrieve top n in each group of a DataFrame in pyspark
There’s a DataFrame in pyspark with data as below: What I expect is returning 2 records in each group with the same user_id, which need to have the highest score. Consequently, the result should look as the following: I’m really new to pyspark, could anyone give me a code snippet or portal to the related documentation of this problem? Great
pyspark, Compare two rows in dataframe
I’m attempting to compare one row in a dataframe with the next to see the difference in timestamp. Currently the data looks like: I’ve tried mapping a function onto the dataframe to allow for comparing like this: (note: I’m trying to get rows with a difference greater than 4 hours) But I’m getting the following error: Which I believe is