I am having a problem running a prediction using a saved MultiLayerPerceptronClassifier model. It throws error: The original mlpc in the pipeline had layers defined: My attempts to solve it: If I run the pipeline model and do predictions without first saving the model. I works with no error. But saving and re-using the model throws this error. Any help
Tag: pyspark
Spark: How to flatten nested arrays with different shapes
How to flatten nested arrays with different shapes in PySpark? Here is answered How to flatten nested arrays by merging values in spark with same shape arrays . I’m getting errors described below for arrays with different shapes. Data-structure: Static names: id, date, val, num (can be hardcoded) Dynamic names: name_1_a , name_10000_xvz(cannot be hardcoded as the data frame has
Pyspark: How to flatten nested arrays by merging values in spark
I have 10000 jsons with different ids each has 10000 names. How to flatten nested arrays by merging values by int or str in pyspark? EDIT: I have added column name_10000_xvz to explain better data structure. I have updated Notes, Input df, required output df and input json files as well. Notes: Input dataframe has more than 10000 columns name_1_a,
Calculate the minimum distance to destinations for each origin in pyspark
I have a list of origins and destinations along with their geo coordinates. I need to calculate the minimum distance for each origin to the destinations. Below is my code: I got error like below: my question is: it seems that there is something wrong with withColumn(‘Distance’, haversine_vector(F.col(‘Origin_Geo’), F.col(‘Destination_Geo’))). I do not know why. (I’m new to pyspark..) I have
Converting pandas dataframe to PySpark dataframe drops index
I’ve got a pandas dataframe called data_clean. It looks like this: I want to convert it to a Spark dataframe, so I use the createDataFrame() method: sparkDF = spark.createDataFrame(data_clean) However, that seems to drop the index column (the one that has the names ali, anthony, bill, etc) from the original dataframe. The output of is The docs say createDataFrame() can
Sum value between overlapping interval slices per group
I have a pyspark dataframe as below: And I want to sum only consumption on overlapping interval slices per idx: Answer You can use sequence to expand the intervals into single days, explode the list of days and then sum the consumption for each timestamp and idx: Output: Remarks: sequence includes the last value of the interval, so one day
New column comparing dates in PySpark
I am struggling to create a new column based off a simple condition comparing two dates. I have tried the following: Which yields a syntax error. I have also updated as follows: But this yields a Python error that the Column is not callable. How would I create a new column that dynamically adjusts based on whether the date comparator
How can I turn off rounding in Spark?
I have a dataframe and I’m doing this: I want to get just the first four numbers after the dot, without rounding. When I cast to DecimalType, with .cast(DataTypes.createDecimalType(20,4) or even with round function, this number is rounded to 0.4220. The only way that I found without rounding is applying the function format_number(), but this function gives me a string,
Pyspark get top two values in column from a group based on ordering
I am trying to get the first two counts that appear in this list, by the earliest log_date they appeared. In this case my expected output is: This is what I have working but there are a few edge cases where count could go down and then back up, shown in the example above. This code returns 2021-07-11 as the
Pivotting DataFrame with fixed column names
Let’s say I have below dataframe: and by design each user has 3 rows. I want to turn my DataFrame into: I was trying to groupBy(col(‘user’)) and then pivot by ticker but it returns as many columns as different tickers there are so instead I wish I could have fixed number of columns. Is there any other Spark operator I