Create column from array of struct Pyspark

Question

I'm pretty new to data processing. I have a deeply nested dataset that have this approximately this schema : For the array, I will receive something like this. Keep in mind that the length is variable, I might receive no value or 10 or even more Is there a way to transform the schema to : with VAT and fiscal1

Accepted Answer

You can use inline function to explode and expand the struct elements of col3.registrationNumbers array, then filter only rows with registrationNumberType either VAT or fiscal1 and pivot. After pivot, update the struct column col3with the pivoted columns:import pyspark.sql.functions as FexampleJSON = '{"col1":"col1_XX","col2":"col2_XX","col3":{"somestring":"xxxxxx","registrationNumbers":[{"registrationNumber":"something","registrationNumberType":"VAT"},{"registrationNumber":"somethingelse","registrationNumberType":"fiscal1"},{"registrationNumber":"something i dont need","registrationNumberType":"fiscal2"}]}}'df = spark.read.json(sc.parallelize([exampleJSON]))df1 = df.selectExpr("*", "inline(col3.registrationNumbers)")     .filter(F.col("registrationNumberType").isin(["VAT", "fiscal1"]))     .groupBy("col1", "col2", "col3")     .pivot("registrationNumberType")     .agg(F.first("registrationNumber"))     .withColumn("col3", F.struct(F.col("col3.somestring"), F.col("VAT"), F.col("fiscal1")))     .drop("VAT", "fiscal1")df1.printSchema()#root# |-- col1: string (nullable = true)# |-- col2: string (nullable = true)# |-- col3: struct (nullable = false)# |    |-- somestring: string (nullable = true)# |    |-- VAT: string (nullable = true)# |    |-- fiscal1: string (nullable = true)df1.show(truncate=False)#+-------+-------+----------------------------------+#|col1   |col2   |col3                              |#+-------+-------+----------------------------------+#|col1_XX|col2_XX|{xxxxxx, something, somethingelse}|#+-------+-------+----------------------------------+

Advertisement

Answer