Tag: pyspark

PySpark reversing StringIndexer in nested array

apache-spark apache-spark-ml apache-spark-sql pyspark python

I’m using PySpark to do collaborative filtering using ALS. My original user and item id’s are strings, so I used StringIndexer to convert them to numeric indices (PySpark’s ALS model obliges us to do so). After I’ve fitted the model, I can get the top 3 recommendations for each user like so: The recs dataframe looks like so: I want

How to start a standalone cluster using pyspark?

apache-spark pyspark python

I am using pyspark under ubuntu with python 2.7 I installed it using And trying to follow the instruction to setup spark cluster I can’t find the script start-master.sh I assume that it has to do with the fact that i installed pyspark and not regular spark I found here that i can connect a worker node to the master

Selecting only numeric/string columns names from a Spark DF in pyspark

apache-spark apache-spark-sql pyspark python

I have a Spark DataFrame in Pyspark (2.1.0) and I am looking to get the names of numeric columns only or string columns only. For example, this is the Schema of my DF: This is what I need: How can I make it? Answer dtypes is list of tuples (columnNane,type) you can use simple filter

PySpark: Get first Non-null value of each column in dataframe

apache-spark apache-spark-sql dataframe pyspark python

I’m dealing with different Spark DataFrames, which have lot of Null values in many columns. I want to get any one non-null value from each of the column to see if that value can be converted to datetime. I tried doing df.na.drop().first() in a hope that it’ll drop all rows with any null value, and of the remaining DataFrame, I’ll

Rename nested field in spark dataframe

apache-spark dataframe pyspark python rename

Having a dataframe df in Spark: How to rename field array_field.a to array_field.a_renamed? [Update]: .withColumnRenamed() does not work with nested fields so I tried this hacky and unsafe method: I know that setting a private attribute is not a good practice but I don’t know other way to set the schema for df I think I am on a right

How to select last row and also how to access PySpark dataframe by index?

apache-spark apache-spark-sql pyspark python

From a PySpark SQL dataframe like How to get the last row.(Like by df.limit(1) I can get first row of dataframe into new dataframe). And how can I access the dataframe rows by index.like row no. 12 or 200 . In pandas I can do I am just curious how to access pyspark dataframe in such ways or alternative ways.

Pyspark: display a spark data frame in a table format

apache-spark-sql pandas pyspark python

I am using pyspark to read a parquet file like below: Then when I do my_df.take(5), it will show [Row(…)], instead of a table format like when we use the pandas data frame. Is it possible to display the data frame in a table format like pandas data frame? Thanks! Answer The show method does what you’re looking for. For

Retrieve top n in each group of a DataFrame in pyspark

apache-spark apache-spark-sql dataframe pyspark python

There’s a DataFrame in pyspark with data as below: What I expect is returning 2 records in each group with the same user_id, which need to have the highest score. Consequently, the result should look as the following: I’m really new to pyspark, could anyone give me a code snippet or portal to the related documentation of this problem? Great

pyspark, Compare two rows in dataframe

apache-spark apache-spark-sql pyspark python

I’m attempting to compare one row in a dataframe with the next to see the difference in timestamp. Currently the data looks like: I’ve tried mapping a function onto the dataframe to allow for comparing like this: (note: I’m trying to get rows with a difference greater than 4 hours) But I’m getting the following error: Which I believe is

Filter Pyspark dataframe column with None value

apache-spark apache-spark-sql dataframe pyspark python

I’m trying to filter a PySpark dataframe that has None as a row value: and I can filter correctly with an string value: but this fails: But there are definitely values on each category. What’s going on? Answer You can use Column.isNull / Column.isNotNull: If you want to simply drop NULL values you can use na.drop with subset argument: Equality