Tag: pyspark

Col names not detected – AnalysisException: Cannot resolve ‘Name’ given input columns ‘col10’

I’m trying to run a transformation function in a pyspark script: My dataset looks like this: My desired output is something like this: However, the last code line gives me an error similar to this: When I check: I see ‘col1’, ‘col2’ etc in the first row instead of the actual labels ( [“Name”,”Type”] ). Should I separately remove and

PySpark write a function to count non zero values of given columns

apache-spark apache-spark-sql pyspark python

I want to have a function that will take as input column names and grouping conditions and based on that for each column it will return the count of non zero values for each column. Something like this, but include non-zero condition as well. Answer You can use a list comprehension to generate the list of aggregation expressions:

how to fill in null values in Pyspark

apache-spark apache-spark-sql pyspark python

I have a df that will join calendar date df, Next Step: I am populating dates range of first and last date. Step2: let’s say this is the calendar df that has id, and calendar dates and i want to join with calendar dates I would like to fill in those all null values based on the first non null

converting python code to python spark code

apache-spark apache-spark-sql pyodbc pyspark python

Below code is in Python and i want to convert this code to pyspark, basically i’m not sure what will be the codefor the statement – pd.read_sql(query,connect_to_hive) to convert into pyspark Need to extract from data from the EDL, so making the connection to the EDL using PYODBC and them extract the data using sql query. pyodbc connection to the

TypeError: ‘GroupedData’ object is not iterable in pyspark dataframe

apache-spark apache-spark-sql pyspark python

I have a Spark dataframe sdf with GPS points that looks like this: Since the spark dataframe contains different GPS trajectories generated by different users on different days, I want to write a function that loops through this df and feeds the corresponding set of coordinates to the (OSRM) request per date and per user group and not all at

how to do multiplication of two pyspark dataframe row wise

apache-spark apache-spark-sql pyspark python

I have below 2 pyspark dataframe df1 and df2 : I want to multiply df1 each row with the same column of df2 row. Final output be like Answer You can do a cross join and multiply the columns using a list comprehension:

convert date month year time to date format pyspark

apache-spark pyspark python

I have a file with timestamp column. When I try to read the file with a schema designed by myself it is populating the datetime column with null. Source file has data as below where I am using the below code snippet in the above DF.display() is showing the result as null for all the inputs. However my expected output

Parse JSON string from Pyspark Dataframe

apache-spark apache-spark-sql json pyspark python

I have a nested JSON dict that I need to convert to spark dataframe. This JSON dict is present in a dataframe column. I have been trying to parse the dict present in dataframe column using “from_json” and “get_json_object”, but have been unable to read the data. Here’s the smallest snippet of the source data that I’ve been trying to

Get tables from AWS Glue using boto3

amazon-web-services aws-glue boto3 pyspark python

I need to harvest tables and column names from AWS Glue crawler metadata catalogue. I used boto3 but constantly getting number of 100 tables even though there are more. Setting up NextToken doesn’t help. Please help if possible. Desired results is list as follows: lst = [table_one.col_one, table_one.col_two, table_two.col_one….table_n.col_n] UPDATED code, still need to have tablename+columnname: Answer Adding sub-loop did

split a list of overlapping intervals into non overlapping subintervals in a pyspark dataframe

apache-spark apache-spark-sql pyspark python

I have a pyspark dataframe that contains the columns start_time, end_time that define an interval per row. There is a column rate, and I want to know if there is not different values for a sub-interval (that is overlapped by definition); and if it is the case, I want to keep the last record as the ground truth. Inputs: Answer