Skip to content
Advertisement

Tag: pyspark

Taking the recency time of client purchase with PySpark

The sample of the dataset I am working on: I’d like to take the customer’s most recent purchase (the customer’s date.max()) and the previous maximum purchase (the penultimate purchase) and take the difference between the two (I’m assuming the same product on all purchases). I still haven’t found something in pyspark that does this. One example of my idea was

Pyspark agg max function showing different result

I was just studying some pyspark code and didnt understand these particular lines. I have a python code such as below: When showing empDF after Isn’t it supposed to show the longest list? It is showing [Python , R] as the output ? I dont understand how is this output coming? Answer Pyspark’s max function returns the maximum value of

Parse multiple line CSV using PySpark , Python or Shell

Input (2 columns) : Note: Harry and Prof. does not have starting quotes Output (2 columns) What I tried (PySpark) ? Issue The above code worked fine where multiline had both start and end double quotes (For eg: row starting with Ronald) But it didnt work with rows where we only have end quotes but no start quotes (like Harry

PySpark – Selecting all rows within each group

I have a dataframe similar to below. From the above dataframe, I would like to keep all rows upto the most recent sale relative to the date. So essentially, I will only have unique date for each row. In the case of above example, output would look like: Can you please guide on how can I go to this result?

SAS Proc Transpose to Pyspark

I am trying to convert a SAS proc transpose statement to pyspark in databricks. With the following data as a sample: I would expect the result to look like this I tried using the pandas pivot_table() function with the following code however I ran into some performance issues with the size of the data: Is there a way to translate

Why do I got TypeError: cannot pickle ‘_thread.RLock’ object when using pyspark

I’m using spark to deal with my data, like that: But I got this error from spark: Traceback (most recent call last): File “/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark_rds_to_parquet.py”, line 46, in process() File “/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark_rds_to_parquet.py”, line 36, in process result = spark.sparkContext.parallelize(dataframe_mysql, 1).map(func) File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/context.py”, line 574, in parallelize File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/context.py”, line 611, in _serialize_to_jvm File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py”, line 211, in dump_stream File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py”, line 133,

Advertisement