I am trying to run a subquery inside a case statement in Pyspark and it is throwing an exception. I am trying to create a new flag if id in one table is present in a different table. Is this even possible in pyspark? Here is the error: I am using Spark 2.2.1. Answer This appears to be the latest
Tag: apache-spark-sql
PySpark reversing StringIndexer in nested array
I’m using PySpark to do collaborative filtering using ALS. My original user and item id’s are strings, so I used StringIndexer to convert them to numeric indices (PySpark’s ALS model obliges us to do so). After I’ve fitted the model, I can get the top 3 recommendations for each user like so: The recs dataframe looks like so: I want
Selecting only numeric/string columns names from a Spark DF in pyspark
I have a Spark DataFrame in Pyspark (2.1.0) and I am looking to get the names of numeric columns only or string columns only. For example, this is the Schema of my DF: This is what I need: How can I make it? Answer dtypes is list of tuples (columnNane,type) you can use simple filter
PySpark: Get first Non-null value of each column in dataframe
I’m dealing with different Spark DataFrames, which have lot of Null values in many columns. I want to get any one non-null value from each of the column to see if that value can be converted to datetime. I tried doing df.na.drop().first() in a hope that it’ll drop all rows with any null value, and of the remaining DataFrame, I’ll
How to select last row and also how to access PySpark dataframe by index?
From a PySpark SQL dataframe like How to get the last row.(Like by df.limit(1) I can get first row of dataframe into new dataframe). And how can I access the dataframe rows by index.like row no. 12 or 200 . In pandas I can do I am just curious how to access pyspark dataframe in such ways or alternative ways.
Pyspark: display a spark data frame in a table format
I am using pyspark to read a parquet file like below: Then when I do my_df.take(5), it will show [Row(…)], instead of a table format like when we use the pandas data frame. Is it possible to display the data frame in a table format like pandas data frame? Thanks! Answer The show method does what you’re looking for. For
Retrieve top n in each group of a DataFrame in pyspark
There’s a DataFrame in pyspark with data as below: What I expect is returning 2 records in each group with the same user_id, which need to have the highest score. Consequently, the result should look as the following: I’m really new to pyspark, could anyone give me a code snippet or portal to the related documentation of this problem? Great
pyspark, Compare two rows in dataframe
I’m attempting to compare one row in a dataframe with the next to see the difference in timestamp. Currently the data looks like: I’ve tried mapping a function onto the dataframe to allow for comparing like this: (note: I’m trying to get rows with a difference greater than 4 hours) But I’m getting the following error: Which I believe is
Converting Pandas dataframe into Spark dataframe error
I’m trying to convert Pandas DF into Spark one. DF head: Code: And I got an error: Answer You need to make sure your pandas dataframe columns are appropriate for the type spark is inferring. If your pandas dataframe lists something like: And you’re getting that error try: Now, make sure .astype(str) is actually the type you want those columns
How can I read in a binary file from hdfs into a Spark dataframe?
I am trying to port some code from pandas to (py)Spark. Unfortunately I am already failing with the input part, where I want to read in binary data and put it in a Spark Dataframe. So far I am using fromfile from numpy: But for Spark I couldn’t find how to do it. My workaround so far was to use