Spark: How to parse JSON string of nested lists to spark data frame?

Question

How to parse JSON string of nested lists to spark data frame in pyspark ? Input data frame: Expected output: Example code: There are few examples, but I can not figure out how to do it: How to parse and transform json string from spark data frame rows in pyspark How to transform JSON string with multiple keys…

Accepted Answer

With some replacements in the strings and by splitting you can get the desired result:from pyspark.sql import functions as Fdf1 = df.withColumn(    "col_1",    F.regexp_replace("url", "https://url.", "")).withColumn(    "col_2_3",    F.explode(        F.expr("""transform(            split(trim(both '][' from json), '\],\['),             x -> struct(split(x, ',')[0] as col_2, split(x, ',')[1] as col_3)        )""")    )).selectExpr("col_1", "col_2_3.*")df1.show(truncate=False)#+-----+-------------+------+#|col_1|col_2        |col_3 |#+-----+-------------+------+#|a    |1572393600000| 1.000|#|a    |1572480000000| 1.007|#|b    |1572825600000| 1.002|#|b    |1572912000000| 1.000|#+-----+-------------+------+Explanation:trim(both '][' from json) : removes trailing and leading caracters [ and ], get someting like: 1572393600000, 1.000],[1572480000000, 1.007Now you can split by ],[ (\ is for escaping the brackets)transform takes the array from the split and for each element, it splits by comma and creates struct col_2 and col_3explode the array of structs you get from the transform and star expand the struct column

Advertisement

Answer