Tag: apache-spark-sql

pyspark: turn array of dict to new columns

apache-spark apache-spark-sql pyspark python

I am struggling to transform my pyspark dataframe which looks like this: to this: I tried to pivot and a bunch of others things but don’t get the result above. Note that I don’t have the exact number of dict in the column Tstring Do you know how I can do this? Answer Using transform function you can convert each

Why do I got TypeError: cannot pickle ‘_thread.RLock’ object when using pyspark

apache-spark apache-spark-sql pyspark python python-3.x

I’m using spark to deal with my data, like that: But I got this error from spark: Traceback (most recent call last): File “/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark_rds_to_parquet.py”, line 46, in process() File “/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark_rds_to_parquet.py”, line 36, in process result = spark.sparkContext.parallelize(dataframe_mysql, 1).map(func) File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/context.py”, line 574, in parallelize File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/context.py”, line 611, in _serialize_to_jvm File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py”, line 211, in dump_stream File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py”, line 133,

PySpark Data Visualization from String Values in Columns

apache-spark apache-spark-sql dataframe pyspark python

I have a table which has the information as shown in the table from a Pyspark dataframe I need to perform a data visualization by plotting the number of completed studies each month in a given year. I am of the opinion that each completed (taken from the status column) will be matched against each of the months of the

How to convert JSON data inside a spark dataframe into new columns

apache-spark apache-spark-sql pyspark python

I have a spark dataframe like I want to convert the JSON (string) to new columns I don’t want to manually specify the keys from JSON as there are more than 100 keys Answer

Column with column names for nulls in row

apache-spark apache-spark-sql null pyspark python

I want to add new column “Null_Values” in PySpark dataframe as below Answer or

Logical with count in Pyspark

apache-spark apache-spark-sql dataframe pyspark python

I’m new to Pyspark and I have a problem to solve. I have a dataframe with 4 columns, being customers, person, is_online_store and count: customer PersonId is_online_store count afabd2d2 4 true 1 afabd2d2 8 true 2 afabd2d2 3 true 1 afabd2d2 2 false 1 afabd2d2 4 false 1 I need to create according to the following rules: If PersonId count(column)

Calculate difference between date column entries and date minimum Pyspark

apache-spark apache-spark-sql dataframe pyspark python

I feel like this is a stupid question, but I cannot seem to figure it out, so here goes. I have a PySpark data frame and one of the columns consists of dates. I want to compute the difference between each date in this column and the minimum date in the column, for the purpose of filtering to the past

Import pipe delimited txt file into spark dataframe in databricks

apache-spark apache-spark-sql pyspark python txt

I have a data file saved as .txt format which has a header row at the top, and is pipe delimited. I am working in databricks, and am needing to create a spark dataframe of this data, with all columns read in as StringType(), the headers defined by the first row, and the columns separated based on the pipe delimiter.

How to get conditional values into new column from several external lists or arrays

apache-spark apache-spark-sql databricks pyspark python

I have the following dataframe: To which I have to create an additional column new_col_cond that is dependent on the values of multiple external lists/arrays (I have also tried with dictionaries), for example: The new column depends on the value of ratio and selects from either array according to id as index. I have tried: with errors coming. I assume

Summarizing labels at time steps based on current and past info

apache-spark apache-spark-sql pyspark python

Given the following input dataframe A dataframe which looks like this needs to be constructed The input dataframe has 10s of millions of records. Some details which are seen in the example above (by design) npos is the size of the vector to be constructed in the output pos is guaranteed to be in [0,npos) at each time step (elap)