Tag: apache-spark

Using join to find similarities between two datasets containing strings in PySpark

I’m trying to match text records in two datasets, mostly via using PySpark (not using libraries such as BM25 or NLP techniques as much as I can for now -using Spark ML and SparkNLP libraries are fine). I’m towards finishing the pre-processing phase. I’ve cleaned the text in both datasets, to…

pyspark: turn array of dict to new columns

apache-spark apache-spark-sql pyspark python

I am struggling to transform my pyspark dataframe which looks like this: to this: I tried to pivot and a bunch of others things but don’t get the result above. Note that I don’t have the exact number of dict in the column Tstring Do you know how I can do this? Answer Using transform function you c…

PySpark – Selecting all rows within each group

apache-spark pyspark python

I have a dataframe similar to below. From the above dataframe, I would like to keep all rows upto the most recent sale relative to the date. So essentially, I will only have unique date for each row. In the case of above example, output would look like: Can you please guide on how can I go to this result?

Why do I got TypeError: cannot pickle ‘_thread.RLock’ object when using pyspark

apache-spark apache-spark-sql pyspark python python-3.x

I’m using spark to deal with my data, like that: But I got this error from spark: Traceback (most recent call last): File “/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark_rds_to_parquet.py”, line 46, in process() File “/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark…

PySpark Data Visualization from String Values in Columns

apache-spark apache-spark-sql dataframe pyspark python

I have a table which has the information as shown in the table from a Pyspark dataframe I need to perform a data visualization by plotting the number of completed studies each month in a given year. I am of the opinion that each completed (taken from the status column) will be matched against each of the mont…

Most efficient way of applying a function based on condition

apache-spark pyspark python

Suppose we have a master dictionary master_dict = {“a”: df1, “b”: df2, “c”: df3}. Now suppose we have a list called condition_list. Suppose func is a function that returns a new dictionary that has the original keys of master_dict along with potentially new keys. What is th…

Pandas UDF throws error not of required length

apache-spark databricks pandas python

I have a delta table which has thrift data from kafka and I am using a UDF to deserialize it. I have no issues when I use regular UDF, but I get an error when I try to use Pandas UDF. This runs fine i.e. ruglar UDF But when I use Pandas UDF I get an error PythonException: ‘RuntimeError: Result

Pyspark: regex search with text in a list withColumn

apache-spark pyspark python regex

I am new to Spark and I am having a silly “what’s-the-best-approach” issue. Basically, I have a map(dict) that I would like to loop over. During each iteration, I want to search through a column in a spark dataframe using rlike regex and assign the key of the dict to a new column using withC…

How to select rows from list in PySpark

apache-spark pyspark python

Suppose we have two dataframes df1 and df2 where df1 has columns [a, b, c, p, q, r] and df2 has columns [d, e, f, a, b, c]. Suppose the common columns are stored in a list common_cols = [‘a’, ‘b’, ‘c’]. How do you join the two dataframes using the common_cols list within a …

How to convert JSON data inside a spark dataframe into new columns

apache-spark apache-spark-sql pyspark python

I have a spark dataframe like I want to convert the JSON (string) to new columns I don’t want to manually specify the keys from JSON as there are more than 100 keys Answer