Tag: apache-spark

Is there a Scala Spark equivalent to pandas Grouper freq feature?

In pandas, if we have a time series and need to group it by a certain frequency (say, every two weeks), it’s possible to use the Grouper class, like this: Is there any equivalent in Spark (more specifically, using Scala) for this feature? Answer You can use the sql function window. First, you create the timestamp column, if you don´t

pyspark structured streaming kafka – py4j.protocol.Py4JJavaError: An error occurred while calling o41.save

apache-kafka apache-spark pyspark python

I have a simple PySpark program which publishes data into kafka. when i do a spark-submit, it gives error Command being run : Error : Spark Version – 3.2.0; I’ve confluent kafka installed on my m/c, here is the version : Here is the code : Any ideas what the issue is ? The Spark version seems to be matching

Spark: How to flatten nested arrays with different shapes

apache-spark apache-spark-sql pyspark python

How to flatten nested arrays with different shapes in PySpark? Here is answered How to flatten nested arrays by merging values in spark with same shape arrays . I’m getting errors described below for arrays with different shapes. Data-structure: Static names: id, date, val, num (can be hardcoded) Dynamic names: name_1_a , name_10000_xvz(cannot be hardcoded as the data frame has

Pyspark: How to flatten nested arrays by merging values in spark

apache-spark apache-spark-sql pyspark python sql

I have 10000 jsons with different ids each has 10000 names. How to flatten nested arrays by merging values by int or str in pyspark? EDIT: I have added column name_10000_xvz to explain better data structure. I have updated Notes, Input df, required output df and input json files as well. Notes: Input dataframe has more than 10000 columns name_1_a,

Converting pandas dataframe to PySpark dataframe drops index

apache-spark dataframe pandas pyspark python

I’ve got a pandas dataframe called data_clean. It looks like this: I want to convert it to a Spark dataframe, so I use the createDataFrame() method: sparkDF = spark.createDataFrame(data_clean) However, that seems to drop the index column (the one that has the names ali, anthony, bill, etc) from the original dataframe. The output of is The docs say createDataFrame() can

New column comparing dates in PySpark

apache-spark pyspark python

I am struggling to create a new column based off a simple condition comparing two dates. I have tried the following: Which yields a syntax error. I have also updated as follows: But this yields a Python error that the Column is not callable. How would I create a new column that dynamically adjusts based on whether the date comparator

How can I turn off rounding in Spark?

apache-spark dataframe pyspark python rounding

I have a dataframe and I’m doing this: I want to get just the first four numbers after the dot, without rounding. When I cast to DecimalType, with .cast(DataTypes.createDecimalType(20,4) or even with round function, this number is rounded to 0.4220. The only way that I found without rounding is applying the function format_number(), but this function gives me a string,

Pyspark get top two values in column from a group based on ordering

apache-spark dataframe pyspark python

I am trying to get the first two counts that appear in this list, by the earliest log_date they appeared. In this case my expected output is: This is what I have working but there are a few edge cases where count could go down and then back up, shown in the example above. This code returns 2021-07-11 as the

Pivotting DataFrame with fixed column names

apache-spark dataframe pyspark python

Let’s say I have below dataframe: and by design each user has 3 rows. I want to turn my DataFrame into: I was trying to groupBy(col(‘user’)) and then pivot by ticker but it returns as many columns as different tickers there are so instead I wish I could have fixed number of columns. Is there any other Spark operator I

PySpark Incremental Count on Condition

apache-spark dataframe pyspark python

Given a Spark dataframe with the following columns I am trying to construct an incremental/running count for each id based on when the contents of the event column evaluate to True. Here a new column called results would be created that contained the incremental count. I’ve tried using window functions but am stumped at this point. Ideally, the solution would