How to parse JSON string of nested lists to spark data frame in pyspark ?
Input data frame:
+-------------+-----------------------------------------------+ |url |json | +-------------+-----------------------------------------------+ |https://url.a|[[1572393600000, 1.000],[1572480000000, 1.007]]| |https://url.b|[[1572825600000, 1.002],[1572912000000, 1.000]]| +-------------+-----------------------------------------------+ root |-- url: string (nullable = true) |-- json: string (nullable = true)
Expected output:
+---------------------------------------+ |col_1 | col_2 | col_3 | +---------------------------------------+ | a | 1572393600000 | 1.000 | | a | 1572480000000 | 1.007 | | b | 1572825600000 | 1.002 | | b | 1572912000000 | 1.000 | +---------------------------------------+
Example code:
import pyspark import pyspark.sql.functions as F spark = (pyspark.sql.SparkSession.builder.appName("Downloader_standalone") .master('local[*]') .getOrCreate()) sc = spark.sparkContext from pyspark.sql import Row rdd_list = [('https://url.a','[[1572393600000, 1.000],[1572480000000, 1.007]]'), ('https://url.b','[[1572825600000, 1.002],[1572912000000, 1.000]]')] jsons = sc.parallelize(rdd_list) df = spark.createDataFrame(jsons, "url string, json string") df.show(truncate=False) df.printSchema() (df.withColumn('json', F.from_json(F.col('json'),"array<string,string>")) .select(F.explode('json').alias('col_1', 'col_2', 'col_3')).show())
There are few examples, but I can not figure out how to do it:
How to parse and transform json string from spark data frame rows in pyspark
How to transform JSON string with multiple keys, from spark data frame rows in pyspark?
Advertisement
Answer
With some replacements in the strings and by splitting you can get the desired result:
from pyspark.sql import functions as F df1 = df.withColumn( "col_1", F.regexp_replace("url", "https://url.", "") ).withColumn( "col_2_3", F.explode( F.expr("""transform( split(trim(both '][' from json), '\],\['), x -> struct(split(x, ',')[0] as col_2, split(x, ',')[1] as col_3) )""") ) ).selectExpr("col_1", "col_2_3.*") df1.show(truncate=False) #+-----+-------------+------+ #|col_1|col_2 |col_3 | #+-----+-------------+------+ #|a |1572393600000| 1.000| #|a |1572480000000| 1.007| #|b |1572825600000| 1.002| #|b |1572912000000| 1.000| #+-----+-------------+------+
Explanation:
trim(both '][' from json)
: removes trailing and leading caracters[
and]
, get someting like:1572393600000, 1.000],[1572480000000, 1.007
Now you can split by
],[
(\
is for escaping the brackets)transform
takes the array from the split and for each element, it splits by comma and creates structcol_2
andcol_3
explode the array of structs you get from the transform and star expand the struct column