Tag: apache-spark-sql

Create column from array of struct Pyspark

apache-spark apache-spark-sql pyspark python

I’m pretty new to data processing. I have a deeply nested dataset that have this approximately this schema : For the array, I will receive something like this. Keep in mind that the length is variable, I might receive no value or 10 or even more Is there a way to transform the schema to : with VAT and f…

How to filter multiple rows based on rows and columns condition in pyspark

apache-spark-sql dataframe pyspark python

I want to filter multiple rows based on “value” column. Ex, i want filter velocity from channel_name column where value>=1 & value <=5 and i want filter Temp from channel_name column where value>=0 & value <=2. Below id my Pysaprk DF. start_timestamp channel_name value 2020-11-…

PYSPARK UDF to explode records based on date range

apache-spark-sql pyspark python

I am a Noob in Python & Pyspark. I need to explode a row of patient into yearly dates, such that each patient has 1 row per year. I wrote a python function (below), and registered it as pyspark UDF (having read many articles here). My problem is that when I apply it on my pyspark dataframe, it fails. My

Spark: How to flatten nested arrays with different shapes

apache-spark apache-spark-sql pyspark python

How to flatten nested arrays with different shapes in PySpark? Here is answered How to flatten nested arrays by merging values in spark with same shape arrays . I’m getting errors described below for arrays with different shapes. Data-structure: Static names: id, date, val, num (can be hardcoded) Dynami…

Pyspark: How to flatten nested arrays by merging values in spark

apache-spark apache-spark-sql pyspark python sql

I have 10000 jsons with different ids each has 10000 names. How to flatten nested arrays by merging values by int or str in pyspark? EDIT: I have added column name_10000_xvz to explain better data structure. I have updated Notes, Input df, required output df and input json files as well. Notes: Input datafram…

How to create new table with first name only in table

apache-spark-sql mysql python sql

I have some data that looks like this: I’d like to create a new table with the name column but with the first name only. Answer This gets the first substring before the space character in name as first_name. first_name Arizona Emerald

Combining WHEN and aggregation functions

apache-spark apache-spark-sql pyspark python

I need to convert this pyspark SQL code sample: Into a fully dataframe code without SQL expression. I tried: TypeError: condition should be a Column But obviously, it’s not working. What am I doing wrong? Any suggestion will be appreciated! Answer Use isNull to check, not is None:

PySpark write a function to count non zero values of given columns

apache-spark apache-spark-sql pyspark python

I want to have a function that will take as input column names and grouping conditions and based on that for each column it will return the count of non zero values for each column. Something like this, but include non-zero condition as well. Answer You can use a list comprehension to generate the list of agg…

how to fill in null values in Pyspark

apache-spark apache-spark-sql pyspark python

I have a df that will join calendar date df, Next Step: I am populating dates range of first and last date. Step2: let’s say this is the calendar df that has id, and calendar dates and i want to join with calendar dates I would like to fill in those all null values based on the first non null

converting python code to python spark code

apache-spark apache-spark-sql pyodbc pyspark python

Below code is in Python and i want to convert this code to pyspark, basically i’m not sure what will be the codefor the statement – pd.read_sql(query,connect_to_hive) to convert into pyspark Need to extract from data from the EDL, so making the connection to the EDL using PYODBC and them extract t…