Spread List of Lists to Sparks DF with PySpark?

I’m currently struggling with following issue: Let’s take following List of Lists: [[1, 2, 3], [4, 5], [6, 7]] How can I create following Sparks DF out of it with one row per element of each sublist: …

How can I generate the same UUID for multiple dataframes in spark?

I have a df that I read from a file import uuid df = spark.read.csv(path, sep=”|”, header=True) Then I give it a UUID column uuidUdf= udf(lambda : str(uuid.uuid4()),StringType()) df = df….

PySpark write a function to count non zero values of given columns

I want to have a function that will take as input column names and grouping conditions and based on that for each column it will return the count of non zero values for each column. Something like …

converting python code to python spark code

Below code is in Python and i want to convert this code to pyspark, basically i’m not sure what will be the codefor the statement – pd.read_sql(query,connect_to_hive) to convert into pyspark Need to …

TypeError: ‘GroupedData’ object is not iterable in pyspark dataframe

I have a Spark dataframe sdf with GPS points that looks like this: d = {‘user’: [‘A’, ‘A’, ‘A’, ‘A’, ‘A’, ‘A’, ‘B’, ‘B’, ‘B’, ‘C’, ‘C’, ‘C’, ‘C’, ‘C’, ‘A’, ‘A’], ‘lat’: [37.75243634842733, 37….

how to do multiplication of two pyspark dataframe row wise

I have below 2 pyspark dataframe df1 and df2 : df1 product 04-01 04-02 04-03 04-05 04-06 cycle 12 24 25 17 39 bike 42 15 4 94 03 bycyle …

convert date month year time to date format pyspark

I have a file with timestamp column. When I try to read the file with a schema designed by myself it is populating the datetime column with null. Source file has data as below created_date 31-AUG-2016 …

Parse JSON string from Pyspark Dataframe

I have a nested JSON dict that I need to convert to spark dataframe. This JSON dict is present in a dataframe column. I have been trying to parse the dict present in dataframe column using “…

Get tables from AWS Glue using boto3

I need to harvest tables and column names from AWS Glue crawler metadata catalogue. I used boto3 but constantly getting number of 100 tables even though there are more. Setting up NextToken doesn’t …

split a list of overlapping intervals into non overlapping subintervals in a pyspark dataframe

I have a pyspark dataframe that contains the columns start_time, end_time that define an interval per row. There is a column rate, and I want to know if there is not different values for a sub-…