Tag: pyspark

pyspark regex extract all

apache-spark bigdata pyspark python regex

I have a dataframe like below. I am trying to extract the next word after function or var My code is here. as it is capture only one word, the final row returns only AWS and not Twitter. So I would like to capture all matching. My spark version is less than 3, so I tried df.withColumn(‘output’, f.expr(“regexp_extract_all(js, ‘(func)s+(w+)|(var)s+(w+)’, 4)”)).show()

Correct Method to Delete Delta Lake Partion on AWS s3

amazon-athena amazon-s3 delta-lake pyspark python

I need to delete a Delta Lake partition with associated AWS s3 files and then need to make sure AWS Athena displays this change. The purpose is because I need to rerun some code to re-populate the data. I tried this And it completed with no errors but the files on s3 still exist and Athena still shows the data

PySpark create a json string by combining columns

apache-spark json pyspark python

I have a dataframe. I would like to perform a transformation that combines a set of columns and stuff into a json string. The columns to be combined is known ahead of time. The output should look like something below. Is there any sugggested method to achieve this? Appreciate any help on this. Answer You can create a struct type

Column with column names for nulls in row

apache-spark apache-spark-sql null pyspark python

I want to add new column “Null_Values” in PySpark dataframe as below Answer or

pyspark create all possible combinations of column values of a dataframe

pandas pyspark python

I want to get all the possible combinations of size 2 of a column in pyspark dataframe. My pyspark dataframe looks like One way would be to collect the values and get them into python iterable (list, pandas df) and use itertools.combinations to generate all combinations. However, I want to avoid collecting the dataframe column to the driver since the

Logical with count in Pyspark

apache-spark apache-spark-sql dataframe pyspark python

I’m new to Pyspark and I have a problem to solve. I have a dataframe with 4 columns, being customers, person, is_online_store and count: customer PersonId is_online_store count afabd2d2 4 true 1 afabd2d2 8 true 2 afabd2d2 3 true 1 afabd2d2 2 false 1 afabd2d2 4 false 1 I need to create according to the following rules: If PersonId count(column)

Calculate difference between date column entries and date minimum Pyspark

apache-spark apache-spark-sql dataframe pyspark python

I feel like this is a stupid question, but I cannot seem to figure it out, so here goes. I have a PySpark data frame and one of the columns consists of dates. I want to compute the difference between each date in this column and the minimum date in the column, for the purpose of filtering to the past

Import pipe delimited txt file into spark dataframe in databricks

apache-spark apache-spark-sql pyspark python txt

I have a data file saved as .txt format which has a header row at the top, and is pipe delimited. I am working in databricks, and am needing to create a spark dataframe of this data, with all columns read in as StringType(), the headers defined by the first row, and the columns separated based on the pipe delimiter.

How to get conditional values into new column from several external lists or arrays

apache-spark apache-spark-sql databricks pyspark python

I have the following dataframe: To which I have to create an additional column new_col_cond that is dependent on the values of multiple external lists/arrays (I have also tried with dictionaries), for example: The new column depends on the value of ratio and selects from either array according to id as index. I have tried: with errors coming. I assume

How to pass a variable into an Pyspark sequence to generate time series?

apache-zeppelin docker pyspark python

I want to generate a time series, from 2021-12-01 to 2021-12-31, but I want to pass the values with variables into de function secuence. This is my code: I want the values 2021-12-01 and 2021-12-31 inside variables. Something like: And get this result: But instead I’m reciving: cannot resolve ‘eldia1’ given input columns: [MES, NEGOCIO]; Answer Easiest would be to