Tag: pyspark

Pyspark create sliding windows from rows with padding

apache-spark apache-spark-sql pyspark python sliding-window

I’m trying to collect groups of rows into sliding windows represented as vectors. Given the example input: An expected output would be: My latest attempt produces tumbling windows without padding. Here’s my code: I tried looking for variations of this, maybe by performing a SQL query like in this case or with some built-in SQL function such as ROWS N

Get the most consecutive day from Date column with PySpark

apache-spark apache-spark-sql pyspark python

Original dataframe: member_id AccessDate 111111 2020-02-03 111111 2022-03-05 222222 2015-03-04 333333 2021-11-23 333333 2021-11-24 333333 2021-11-25 333333 2022-10-11 333333 2022-10-12 333333 2022-10-13 333333 2022-07-07 444444 2019-01-21 444444 2019-04-21 444444 2019-04-22 444444 2019-04-23 444444 2019-04-24 444444 2019-05-05 444444 2019-05-06 444444 2019-05-07 Result dataframe: member_id Most_Consecutive_AccessDate total 111111 2022-03-05 1 222222 2015-03-04 1 333333 2022-10-11, 2022-10-12, 2022-10-13 3 444444 2019-04-21, 2019-04-22, 2019-04-23,

AWS Glue Job upsert from one db table to annother db table

amazon-web-services aws-glue pyspark python sql

I am trying to create a pretty basic Glue job. I have two different AWS RDS Mariadb’s, with two similar tables (field names are different). I would like to transform the data from table A so it fits with table B schema (this seems pretty trivial and is working). And then i would like to update all existing entries (on

join two rows itertively to create new table in spark with one row for each two rows in new table

apache-spark dataframe pyspark python

Have a table where I want to go in range of two rows How to I create below table that goes in a range of two and shows the first id with the second col b and message in spark. Final table will look like this. Answer In pyspark you can use Window, example Output:

Extract first fields from struct columns into a dictionary

dictionary field pyspark python struct

I need to create a dictionary from Spark dataframe’s schema of type pyspark.sql.types.StructType. The code needs to go through entire StructType, find only those StructField elements which are of type StructType and, when extracting into dictionary, use the name of parent StructField as key while value would be name of only the first nested/child StructField. Example schema (StructType): Desired result:

how to use multiple when conditions in pyspark for updating column values

azure-databricks azure-synapse dataframe pyspark python

I am looking for a solution where we can use multiple when conditions for updating a column values in pyspark. I am currently trying to achieve a solution when we have multiple conditions in spark how we can update a column. I have one dataframe in which we have three columns DATE, Flag_values, salary: After this I have to update

PySpark: Performing One-Hot-Encoding

apache-spark pyspark python

I need to perform classification task on a dataset which consists categorical variables. I performed the one-hot encoding on that data. But I am confused that whether I am doing it right way or not. Step 1: Lets say, for example, this is a dataset: Step 2: After performing one-hot encoding it gives this data: Step 3: Here the fourth

Regexp_replace “,” with “.” every other commas in spark

pyspark python sql

I have a dataframe that instead of . it has , and separators of numbers are also comma, I need to replace only odd comma to dot. The dataframe is very big but as an example, I have this: I want this df: Answer You can split on all commas , and later you can use for-loop: with range(0, len(splitted_data),

Replicate a function from pandas into pyspark

apache-spark pandas pyspark python

I am trying to execute the same function on a spark dataframe rather than pandas. Answer A direct translation would require you to do multiple collect for each column calculation. I suggest you do all calculations for columns in the dataframe as a single row and then collect that row. Here’s an example. Calculate percentage of whitespace values and number

Replace Spark array values with values from python dictionary

apache-spark arrays pyspark python replace

I have a Spark dataframe column having array values: I want to replace [0,1,2,3,4] with [negative,positive,name,sequel,odd] Answer