Skip to content
Advertisement

Spark: How to flatten nested arrays with different shapes

How to flatten nested arrays with different shapes in PySpark? Here is answered How to flatten nested arrays by merging values in spark with same shape arrays . I’m getting errors described below for arrays with different shapes.

Data-structure:

  • Static names: id, date, val, num (can be hardcoded)
  • Dynamic names: name_1_a , name_10000_xvz(cannot be hardcoded as the data frame has up to 10000 columns/arrays)

Input df:

JavaScript

Required output df:

JavaScript

Code to reproduce:

NOTE: when i add el.num in TRANSFORM({name}, el -> STRUCT("{name}" AS name, el.date, el.val, el.num I get the error below.

JavaScript

Output:

JavaScript

Advertisement

Answer

you need to explode each array individually, use probably an UDF to complete the missing values and unionAll each newly created dataframes. That’s for the pyspark part. For the python part, you just need to loop through the different columns and let the magic appen :

JavaScript

here is the result:

JavaScript

Another solution without using unionAll :

JavaScript

And the result :

JavaScript
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement