Tag: pyspark

Summarizing labels at time steps based on current and past info

apache-spark apache-spark-sql pyspark python

Given the following input dataframe A dataframe which looks like this needs to be constructed The input dataframe has 10s of millions of records. Some details which are seen in the example above (by design) npos is the size of the vector to be constructed in the output pos is guaranteed to be in [0,npos) at each time step (elap)

Create column from array of struct Pyspark

apache-spark apache-spark-sql pyspark python

I’m pretty new to data processing. I have a deeply nested dataset that have this approximately this schema : For the array, I will receive something like this. Keep in mind that the length is variable, I might receive no value or 10 or even more Is there a way to transform the schema to : with VAT and fiscal1

How to filter multiple rows based on rows and columns condition in pyspark

apache-spark-sql dataframe pyspark python

I want to filter multiple rows based on “value” column. Ex, i want filter velocity from channel_name column where value>=1 & value <=5 and i want filter Temp from channel_name column where value>=0 & value <=2. Below id my Pysaprk DF. start_timestamp channel_name value 2020-11-02 08:51:50 velocity 1 2020-11-02 09:14:29 Temp 0 2020-11-02 09:18:32 velocity 0 2020-11-02 09:32:42 velocity 4

pyspark – How to define MapType for when/otherwise

dictionary pyspark python

I have a pyspark DataFrame with a MapType column that either contains the map<string, int> format or is None. I need to perform some calculations using collect_list. But collect_list excludes None values and I am trying to find a workaround, by transforming None to string similar to Include null values in collect_list in pyspark However, on my case I can’t

pyspark structured streaming kafka – py4j.protocol.Py4JJavaError: An error occurred while calling o41.save

apache-kafka apache-spark pyspark python

I have a simple PySpark program which publishes data into kafka. when i do a spark-submit, it gives error Command being run : Error : Spark Version – 3.2.0; I’ve confluent kafka installed on my m/c, here is the version : Here is the code : Any ideas what the issue is ? The Spark version seems to be matching

Splitting object data into new columns in dataframe

dataframe pandas pyspark python

i have a dataframe with column business_id and attributes with thousands of rows like this: how do create new column for each attribute with the value to the business id ? and if it’s not applicable to that business id, it will specify false. example: while also noting that there are some attributes with value as object in an object

Using pyspark.sql.functions without sparkContext import problem

pyspark python

I have situation which can be trivialized to example with two files. filters.py main.py It appears, that F.col object cannot be created without active sparkSession/sparkContext object, so import fails. Is there any way to keep filters separated from other files and how i can import them? My situation is a little bit more complicated, this filters is used in many

PYSPARK UDF to explode records based on date range

apache-spark-sql pyspark python

I am a Noob in Python & Pyspark. I need to explode a row of patient into yearly dates, such that each patient has 1 row per year. I wrote a python function (below), and registered it as pyspark UDF (having read many articles here). My problem is that when I apply it on my pyspark dataframe, it fails. My

Create array of differences in col between two adjacent numbers in an array python/pyspark

arrays difference loops pyspark python

I have a column of arrays made of numbers, ie [0,80,160,220], and would like to create a column of arrays of the differences between adjacent terms, ie [80,80,60] Does anyone have an idea how to approach this in Python or PySpark? I’m thinking of something iterative (ith term minus i-1th term starting at second term) but am really stuck how