Tag: apache-spark

Pyspark create sliding windows from rows with padding

apache-spark apache-spark-sql pyspark python sliding-window

I’m trying to collect groups of rows into sliding windows represented as vectors. Given the example input: An expected output would be: My latest attempt produces tumbling windows without padding. Here’s my code: I tried looking for variations of this, maybe by performing a SQL query like in this case or with some built-in SQL function such as ROWS N

Get the most consecutive day from Date column with PySpark

apache-spark apache-spark-sql pyspark python

Original dataframe: member_id AccessDate 111111 2020-02-03 111111 2022-03-05 222222 2015-03-04 333333 2021-11-23 333333 2021-11-24 333333 2021-11-25 333333 2022-10-11 333333 2022-10-12 333333 2022-10-13 333333 2022-07-07 444444 2019-01-21 444444 2019-04-21 444444 2019-04-22 444444 2019-04-23 444444 2019-04-24 444444 2019-05-05 444444 2019-05-06 444444 2019-05-07 Result dataframe: member_id Most_Consecutive_AccessDate total 111111 2022-03-05 1 222222 2015-03-04 1 333333 2022-10-11, 2022-10-12, 2022-10-13 3 444444 2019-04-21, 2019-04-22, 2019-04-23,

join two rows itertively to create new table in spark with one row for each two rows in new table

apache-spark dataframe pyspark python

Have a table where I want to go in range of two rows How to I create below table that goes in a range of two and shows the first id with the second col b and message in spark. Final table will look like this. Answer In pyspark you can use Window, example Output:

PySpark: Performing One-Hot-Encoding

apache-spark pyspark python

I need to perform classification task on a dataset which consists categorical variables. I performed the one-hot encoding on that data. But I am confused that whether I am doing it right way or not. Step 1: Lets say, for example, this is a dataset: Step 2: After performing one-hot encoding it gives this data: Step 3: Here the fourth

Replicate a function from pandas into pyspark

apache-spark pandas pyspark python

I am trying to execute the same function on a spark dataframe rather than pandas. Answer A direct translation would require you to do multiple collect for each column calculation. I suggest you do all calculations for columns in the dataframe as a single row and then collect that row. Here’s an example. Calculate percentage of whitespace values and number

Replace Spark array values with values from python dictionary

apache-spark arrays pyspark python replace

I have a Spark dataframe column having array values: I want to replace [0,1,2,3,4] with [negative,positive,name,sequel,odd] Answer

PySpark sum all the values of Map column into a new column

apache-spark apache-spark-sql dataframe pyspark python

I have a dataframe which looks like this I want to sum of all the row wise decimal values and store into a new column My approach This is not working as it says, it can be applied only to int Answer Since, your values are of float type, the initial value passed within the aggregate should match the type

I am getting error while defining H2OContext in python spark script

apache-spark h2o h2o.ai python sparkling-water

Code: I am using spark standalone cluster 3.2.1 and try to initiate H2OContext in python file. while trying to run the script using spark-submit, i am getting following error: Spark-submit command: spark-submit –master spark://local:7077 –packages ai.h2o:sparkling-water-package_2.12:3.36.1.3-1-3.2 spark_h20/h2o.py Answer The parameter –packages ai.h2o:sparkling-water-package_2.12:3.36.1.3-1-3.2 downloads a jar artifact from Maven. This artifact could be used only for Scala/Java. I see there is

PySpark – Cumulative sum with limits

apache-spark dataframe pyspark python window

I have a dataframe as follows: The goal is to calculate a score for the user_id using valor as base, it will start from 3 and increase or decrease by 1 as it goes in the valor column. The main problem here is that my score can’t be under 1 and can’t be over 5, so the sum must always

How to write this pandas logic for pyspark.sql.dataframe.DataFrame without using pandas on spark API?

apache-spark apache-spark-sql pyspark python

I’m totally new to Pyspark, as Pyspark doesn’t have loc feature how can we write this logic. I tried by specifying conditions but couldn’t get the desirable result, any help would be greatly appreciated! Answer For a data like the following You’re actually updating total column in each statement, not in an if-then-else way. Your code can be replicated (as