Tag: apache-spark

collect_list by preserving order based on another variable

I am trying to create a new column of lists in Pyspark using a groupby aggregation on existing set of columns. An example input data frame is provided below: The expected output is: The values within a list are sorted by the date. I tried using collect_list as follows: But collect_list doesn’t guarantee order even if I sort the input

PySpark reversing StringIndexer in nested array

apache-spark apache-spark-ml apache-spark-sql pyspark python

I’m using PySpark to do collaborative filtering using ALS. My original user and item id’s are strings, so I used StringIndexer to convert them to numeric indices (PySpark’s ALS model obliges us to do so). After I’ve fitted the model, I can get the top 3 recommendations for each user like so: The recs dataframe looks like so: I want

How to start a standalone cluster using pyspark?

apache-spark pyspark python

I am using pyspark under ubuntu with python 2.7 I installed it using And trying to follow the instruction to setup spark cluster I can’t find the script start-master.sh I assume that it has to do with the fact that i installed pyspark and not regular spark I found here that i can connect a worker node to the master

Selecting only numeric/string columns names from a Spark DF in pyspark

apache-spark apache-spark-sql pyspark python

I have a Spark DataFrame in Pyspark (2.1.0) and I am looking to get the names of numeric columns only or string columns only. For example, this is the Schema of my DF: This is what I need: How can I make it? Answer dtypes is list of tuples (columnNane,type) you can use simple filter

PySpark: Get first Non-null value of each column in dataframe

apache-spark apache-spark-sql dataframe pyspark python

I’m dealing with different Spark DataFrames, which have lot of Null values in many columns. I want to get any one non-null value from each of the column to see if that value can be converted to datetime. I tried doing df.na.drop().first() in a hope that it’ll drop all rows with any null value, and of the remaining DataFrame, I’ll

Rename nested field in spark dataframe

apache-spark dataframe pyspark python rename

Having a dataframe df in Spark: How to rename field array_field.a to array_field.a_renamed? [Update]: .withColumnRenamed() does not work with nested fields so I tried this hacky and unsafe method: I know that setting a private attribute is not a good practice but I don’t know other way to set the schema for df I think I am on a right

How to run Spark code in Airflow?

airflow apache-spark directed-acyclic-graphs java python

Hello people of the Earth! I’m using Airflow to schedule and run Spark tasks. All I found by this time is python DAGs that Airflow can manage. DAG example: The problem is I’m not good in Python code and have some tasks written in Java. My question is how to run Spark Java jar in python DAG? Or maybe there

How to select last row and also how to access PySpark dataframe by index?

apache-spark apache-spark-sql pyspark python

From a PySpark SQL dataframe like How to get the last row.(Like by df.limit(1) I can get first row of dataframe into new dataframe). And how can I access the dataframe rows by index.like row no. 12 or 200 . In pandas I can do I am just curious how to access pyspark dataframe in such ways or alternative ways.

Retrieve top n in each group of a DataFrame in pyspark

apache-spark apache-spark-sql dataframe pyspark python

There’s a DataFrame in pyspark with data as below: What I expect is returning 2 records in each group with the same user_id, which need to have the highest score. Consequently, the result should look as the following: I’m really new to pyspark, could anyone give me a code snippet or portal to the related documentation of this problem? Great

pyspark, Compare two rows in dataframe

apache-spark apache-spark-sql pyspark python

I’m attempting to compare one row in a dataframe with the next to see the difference in timestamp. Currently the data looks like: I’ve tried mapping a function onto the dataframe to allow for comparing like this: (note: I’m trying to get rows with a difference greater than 4 hours) But I’m getting the following error: Which I believe is