I wonder how you would iteratively get the values from a json string in pyspark. I have the following format of my data and would like to create the “value” column:
id_1 | id_2 | json_string | value |
---|---|---|---|
1 | 1001 | {“1001”:106, “2200”:101} | 106 |
1 | 2200 | {“1001”:106, “2200”:101} | 101 |
JavaScript
x
5
1
df_2 = df.withColumn
2
(
3
'value', get_json_object(col('json_string'), concat(lit('$.'), col('id_2')))
4
)
5
Which gives the error Column is not iterable
However, just inserting the key manually works, i.e:
JavaScript
1
5
1
df_2 = df.withColumn
2
(
3
'value', get_json_object(col('json_string'), '$.1001')))
4
)
5
Any tips on solving this problem? It is not possible to manually insert the “id_2” values since there are many thousands of keys within the dataset and the json_string is in reality much longer with many more key-value pairs.
Super thankful for any suggestions!
Regards
Advertisement
Answer
You can use it within expr()
which would allow you to concat the string and id_2
.
JavaScript
1
25
25
1
data_ls = [
2
("1", "1001", '''{"1001":106, "2200":101}'''),
3
("1", "2200", '''{"1001":106, "2200":101}''')
4
]
5
6
data_sdf = spark.createDataFrame(data_ls, ("id1", "id2", "jstr"))
7
8
# +---+----+--------------------+
9
# |id1| id2| jstr|
10
# +---+----+--------------------+
11
# | 1|1001|{"1001":106, "220...|
12
# | 1|2200|{"1001":106, "220...|
13
# +---+----+--------------------+
14
15
data_sdf.
16
withColumn('val', func.expr('get_json_object(jstr, concat("$.", id2))')).
17
show(truncate=False)
18
19
# +---+----+------------------------+---+
20
# |id1|id2 |jstr |val|
21
# +---+----+------------------------+---+
22
# |1 |1001|{"1001":106, "2200":101}|106|
23
# |1 |2200|{"1001":106, "2200":101}|101|
24
# +---+----+------------------------+---+
25