How to parse and transform json string from spark dataframe rows in pyspark?
I’m looking for help how to parse:
- json string to json struct
output 1
- transform json string to columns a, b and id
output 2
Background: I get via API json strings with a large number of rows (jstr1
, jstr2
, …), which are saved to spark df
. I can read schema for each row separately, but this is not the solution as it is very slow as schema has a large number of rows. Each jstr
has the same schema, columns/keys a and b stays the same, just id
and values in columns change.
EDIT: blackbishop solution to use MapType schema works like a charm schema = "map<string, array<struct<a:int,b:int>>>"
Question was extended to: How to transform JSON string with multiple keys, from spark data frame rows in pyspark?
from pyspark.sql import Row jstr1 = '{"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}]}' jstr2 = '{"id_2": [{"a": 5, "b": 6}, {"a": 7, "b": 8}]}' df = sqlContext.createDataFrame([Row(json=jstr1),Row(json=jstr2)]) schema = F.schema_of_json(df.select(F.col("json")).take(1)[0].json) df2 = df.withColumn('json', F.from_json(F.col('json'), schema)) df2.show()
Current output:
+--------------------+ | json| +--------------------+ |[[[1, 2], [3, 4]]] | | []| +--------------------+
Required output 1:
+--------------------+-------+ | json | id | +--------------------+-------+ |[[[1, 2], [3, 4]]] | id_1 | |[[[5, 6], [7, 8]]] | id_2 | +--------------------+-------+
Required output 2:
+---------+----------+-------+ | a | b | id | +--------------------+-------+ | 1 | 2 | id_1 | | 3 | 4 | id_1 | | 5 | 6 | id_2 | | 7 | 8 | id_2 | +---------+----------+-------+
Advertisement
Answer
You’re getting null for the second row because you’re using only the schema of the first row which is different from the second one. You can parse the JSON to a MapType instead, where the keys are of type string and values of type array of structs :
schema = "map<string, array<struct<a:int,b:int>>>" df = df.withColumn('json', F.from_json(F.col('json'), schema)) df.printSchema() #root # |-- json: map (nullable = true) # | |-- key: string # | |-- value: array (valueContainsNull = true) # | | |-- element: struct (containsNull = true) # | | | |-- a: integer (nullable = true) # | | | |-- b: integer (nullable = true)
Then, with some simple transformations, you get the expected outputs:
- The
id
column represents the key in the map, you get it withmap_keys
function - The structs
<a:int, b:int>
represents the values that you get usingmap_values
function
output1 = df.withColumn("id", F.map_keys("json").getItem(0)) .withColumn("json", F.map_values("json").getItem(0)) output1.show(truncate=False) # +----------------+----+ # |json |id | # +----------------+----+ # |[[1, 2], [3, 4]]|id_1| # |[[5, 6], [7, 8]]|id_2| # +----------------+----+ output2 = output1.withColumn("attr", F.explode("json")) .select("id", "attr.*") output2.show(truncate=False) # +----+---+---+ # |id |a |b | # +----+---+---+ # |id_1|1 |2 | # |id_1|3 |4 | # |id_2|5 |6 | # |id_2|7 |8 | # +----+---+---+