Skip to content
Advertisement

Tag: apache-spark

PySpark – Selecting all rows within each group

I have a dataframe similar to below. From the above dataframe, I would like to keep all rows upto the most recent sale relative to the date. So essentially, I will only have unique date for each row. In the case of above example, output would look like: Can you please guide on how can I go to this result?

Why do I got TypeError: cannot pickle ‘_thread.RLock’ object when using pyspark

I’m using spark to deal with my data, like that: But I got this error from spark: Traceback (most recent call last): File “/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark_rds_to_parquet.py”, line 46, in process() File “/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark_rds_to_parquet.py”, line 36, in process result = spark.sparkContext.parallelize(dataframe_mysql, 1).map(func) File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/context.py”, line 574, in parallelize File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/context.py”, line 611, in _serialize_to_jvm File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py”, line 211, in dump_stream File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py”, line 133,

Most efficient way of applying a function based on condition

Suppose we have a master dictionary master_dict = {“a”: df1, “b”: df2, “c”: df3}. Now suppose we have a list called condition_list. Suppose func is a function that returns a new dictionary that has the original keys of master_dict along with potentially new keys. What is the best way to get the below code to work when the length of

Pyspark: regex search with text in a list withColumn

I am new to Spark and I am having a silly “what’s-the-best-approach” issue. Basically, I have a map(dict) that I would like to loop over. During each iteration, I want to search through a column in a spark dataframe using rlike regex and assign the key of the dict to a new column using withColumn The data sample is shown

How to select rows from list in PySpark

Suppose we have two dataframes df1 and df2 where df1 has columns [a, b, c, p, q, r] and df2 has columns [d, e, f, a, b, c]. Suppose the common columns are stored in a list common_cols = [‘a’, ‘b’, ‘c’]. How do you join the two dataframes using the common_cols list within a sql command? The code below

Advertisement