I have a dataframe like below. I am trying to extract the next word after function or var My code is here. as it is capture only one word, the final row returns only AWS and not Twitter. So I would like to capture all matching. My spark version is less than 3, so I tried df.withColumn(‘output’, f.expr(“regexp_extract_all(js, ‘(func)s+(w+)|(var)s+(w+)’, 4)”)).show()
Tag: pyspark
Correct Method to Delete Delta Lake Partion on AWS s3
I need to delete a Delta Lake partition with associated AWS s3 files and then need to make sure AWS Athena displays this change. The purpose is because I need to rerun some code to re-populate the data. I tried this And it completed with no errors but the files on s3 still exist and Athena still shows the data
PySpark create a json string by combining columns
I have a dataframe. I would like to perform a transformation that combines a set of columns and stuff into a json string. The columns to be combined is known ahead of time. The output should look like something below. Is there any sugggested method to achieve this? Appreciate any help on this. Answer You can create a struct type
Column with column names for nulls in row
I want to add new column “Null_Values” in PySpark dataframe as below Answer or
pyspark create all possible combinations of column values of a dataframe
I want to get all the possible combinations of size 2 of a column in pyspark dataframe. My pyspark dataframe looks like One way would be to collect the values and get them into python iterable (list, pandas df) and use itertools.combinations to generate all combinations. However, I want to avoid collecting the dataframe column to the driver since the
Logical with count in Pyspark
I’m new to Pyspark and I have a problem to solve. I have a dataframe with 4 columns, being customers, person, is_online_store and count: customer PersonId is_online_store count afabd2d2 4 true 1 afabd2d2 8 true 2 afabd2d2 3 true 1 afabd2d2 2 false 1 afabd2d2 4 false 1 I need to create according to the following rules: If PersonId count(column)
Calculate difference between date column entries and date minimum Pyspark
I feel like this is a stupid question, but I cannot seem to figure it out, so here goes. I have a PySpark data frame and one of the columns consists of dates. I want to compute the difference between each date in this column and the minimum date in the column, for the purpose of filtering to the past
Import pipe delimited txt file into spark dataframe in databricks
I have a data file saved as .txt format which has a header row at the top, and is pipe delimited. I am working in databricks, and am needing to create a spark dataframe of this data, with all columns read in as StringType(), the headers defined by the first row, and the columns separated based on the pipe delimiter.
How to get conditional values into new column from several external lists or arrays
I have the following dataframe: To which I have to create an additional column new_col_cond that is dependent on the values of multiple external lists/arrays (I have also tried with dictionaries), for example: The new column depends on the value of ratio and selects from either array according to id as index. I have tried: with errors coming. I assume
How to pass a variable into an Pyspark sequence to generate time series?
I want to generate a time series, from 2021-12-01 to 2021-12-31, but I want to pass the values with variables into de function secuence. This is my code: I want the values 2021-12-01 and 2021-12-31 inside variables. Something like: And get this result: But instead I’m reciving: cannot resolve ‘eldia1’ given input columns: [MES, NEGOCIO]; Answer Easiest would be to