Skip to content
Advertisement

Create column from array of struct Pyspark

I’m pretty new to data processing. I have a deeply nested dataset that have this approximately this schema :

JavaScript

For the array, I will receive something like this. Keep in mind that the length is variable, I might receive no value or 10 or even more

JavaScript

Is there a way to transform the schema to :

JavaScript

with VAT and fiscal1 value being the registrationNumber value. I basically need to get a column with the VAT and the fiscal1 value as column

Thanks so much

Edit:

Here is a sample json of col3

JavaScript

and here is what I would like to have :

JavaScript

Maybe I can, create a dataframe using the array and the primary key, create VAT and fiscal1 columns and select data from the new dataframe to input in the column? Finally to join the 2 dataframes using the primary key

Advertisement

Answer

You can use inline function to explode and expand the struct elements of col3.registrationNumbers array, then filter only rows with registrationNumberType either VAT or fiscal1 and pivot. After pivot, update the struct column col3with the pivoted columns:

JavaScript
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement