Tag: pyspark

Double assignment of same variable in one expression in Python – does it have any purpose?

apache-spark delta-lake pyspark python python-3.x

I was going through delta lake documentation page. There is a line like this : In the last line, we see a double assignment of same variable (spark). Does this do something different compared to : In general, in python, is there a meaning to double assignment of same variable ? Answer Short answer: It’s…

Rename a redshift SQL table within PySpark Databricks

amazon-redshift apache-spark pyspark python

I want to rename a redshift table within a Python Databricks notebook. Currently I have a query that pulls in data and creates a table: I want to take this table I created and rename it. I referenced this doc but find it hard to follow. I want to run this SQL command alter table public.test rename to test_tab…

Using join to find similarities between two datasets containing strings in PySpark

apache-spark join pyspark python

I’m trying to match text records in two datasets, mostly via using PySpark (not using libraries such as BM25 or NLP techniques as much as I can for now -using Spark ML and SparkNLP libraries are fine). I’m towards finishing the pre-processing phase. I’ve cleaned the text in both datasets, to…

Pyspark agg max function showing different result

pyspark python python-3.x

I was just studying some pyspark code and didnt understand these particular lines. I have a python code such as below: When showing empDF after Isn’t it supposed to show the longest list? It is showing [Python , R] as the output ? I dont understand how is this output coming? Answer Pyspark’s max f…

pyspark: turn array of dict to new columns

apache-spark apache-spark-sql pyspark python

I am struggling to transform my pyspark dataframe which looks like this: to this: I tried to pivot and a bunch of others things but don’t get the result above. Note that I don’t have the exact number of dict in the column Tstring Do you know how I can do this? Answer Using transform function you c…

Parse multiple line CSV using PySpark , Python or Shell

awk csv pyspark python shell

Input (2 columns) : Note: Harry and Prof. does not have starting quotes Output (2 columns) What I tried (PySpark) ? Issue The above code worked fine where multiline had both start and end double quotes (For eg: row starting with Ronald) But it didnt work with rows where we only have end quotes but no start qu…

PySpark – Selecting all rows within each group

apache-spark pyspark python

I have a dataframe similar to below. From the above dataframe, I would like to keep all rows upto the most recent sale relative to the date. So essentially, I will only have unique date for each row. In the case of above example, output would look like: Can you please guide on how can I go to this result?

SAS Proc Transpose to Pyspark

databricks pyspark python sas

I am trying to convert a SAS proc transpose statement to pyspark in databricks. With the following data as a sample: I would expect the result to look like this I tried using the pandas pivot_table() function with the following code however I ran into some performance issues with the size of the data: Is ther…

Why do I got TypeError: cannot pickle ‘_thread.RLock’ object when using pyspark

apache-spark apache-spark-sql pyspark python python-3.x

I’m using spark to deal with my data, like that: But I got this error from spark: Traceback (most recent call last): File “/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark_rds_to_parquet.py”, line 46, in process() File “/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark…