Tag: pyspark

NoSuchElementException: Failed to find a default value for layers in MultiLayerPerceptronClassifier

machine-learning neural-network pipeline pyspark python

I am having a problem running a prediction using a saved MultiLayerPerceptronClassifier model. It throws error: The original mlpc in the pipeline had layers defined: My attempts to solve it: If I run the pipeline model and do predictions without first saving the model. I works with no error. But saving and re-using the model throws this error. Any help

Spark: How to flatten nested arrays with different shapes

apache-spark apache-spark-sql pyspark python

How to flatten nested arrays with different shapes in PySpark? Here is answered How to flatten nested arrays by merging values in spark with same shape arrays . I’m getting errors described below for arrays with different shapes. Data-structure: Static names: id, date, val, num (can be hardcoded) Dynamic names: name_1_a , name_10000_xvz(cannot be hardcoded as the data frame has

Pyspark: How to flatten nested arrays by merging values in spark

apache-spark apache-spark-sql pyspark python sql

I have 10000 jsons with different ids each has 10000 names. How to flatten nested arrays by merging values by int or str in pyspark? EDIT: I have added column name_10000_xvz to explain better data structure. I have updated Notes, Input df, required output df and input json files as well. Notes: Input dataframe has more than 10000 columns name_1_a,

Calculate the minimum distance to destinations for each origin in pyspark

haversine pyspark python

I have a list of origins and destinations along with their geo coordinates. I need to calculate the minimum distance for each origin to the destinations. Below is my code: I got error like below: my question is: it seems that there is something wrong with withColumn(‘Distance’, haversine_vector(F.col(‘Origin_Geo’), F.col(‘Destination_Geo’))). I do not know why. (I’m new to pyspark..) I have

Converting pandas dataframe to PySpark dataframe drops index

apache-spark dataframe pandas pyspark python

I’ve got a pandas dataframe called data_clean. It looks like this: I want to convert it to a Spark dataframe, so I use the createDataFrame() method: sparkDF = spark.createDataFrame(data_clean) However, that seems to drop the index column (the one that has the names ali, anthony, bill, etc) from the original dataframe. The output of is The docs say createDataFrame() can

Sum value between overlapping interval slices per group

intervals overlap pyspark python

I have a pyspark dataframe as below: And I want to sum only consumption on overlapping interval slices per idx: Answer You can use sequence to expand the intervals into single days, explode the list of days and then sum the consumption for each timestamp and idx: Output: Remarks: sequence includes the last value of the interval, so one day

New column comparing dates in PySpark

apache-spark pyspark python

I am struggling to create a new column based off a simple condition comparing two dates. I have tried the following: Which yields a syntax error. I have also updated as follows: But this yields a Python error that the Column is not callable. How would I create a new column that dynamically adjusts based on whether the date comparator

How can I turn off rounding in Spark?

apache-spark dataframe pyspark python rounding

I have a dataframe and I’m doing this: I want to get just the first four numbers after the dot, without rounding. When I cast to DecimalType, with .cast(DataTypes.createDecimalType(20,4) or even with round function, this number is rounded to 0.4220. The only way that I found without rounding is applying the function format_number(), but this function gives me a string,

Pyspark get top two values in column from a group based on ordering

apache-spark dataframe pyspark python

I am trying to get the first two counts that appear in this list, by the earliest log_date they appeared. In this case my expected output is: This is what I have working but there are a few edge cases where count could go down and then back up, shown in the example above. This code returns 2021-07-11 as the

Pivotting DataFrame with fixed column names

apache-spark dataframe pyspark python

Let’s say I have below dataframe: and by design each user has 3 rows. I want to turn my DataFrame into: I was trying to groupBy(col(‘user’)) and then pivot by ticker but it returns as many columns as different tickers there are so instead I wish I could have fixed number of columns. Is there any other Spark operator I