I wonder how you would iteratively get the values from a json string in pyspark. I have the following format of my data and would like to create the “value” column:
id_1 | id_2 | json_string | value |
---|---|---|---|
1 | 1001 | {“1001”:106, “2200”:101} | 106 |
1 | 2200 | {“1001”:106, “2200”:101} | 101 |
df_2 = df.withColumn ( 'value', get_json_object(col('json_string'), concat(lit('$.'), col('id_2'))) )
Which gives the error Column is not iterable
However, just inserting the key manually works, i.e:
df_2 = df.withColumn ( 'value', get_json_object(col('json_string'), '$.1001'))) )
Any tips on solving this problem? It is not possible to manually insert the “id_2” values since there are many thousands of keys within the dataset and the json_string is in reality much longer with many more key-value pairs.
Super thankful for any suggestions!
Regards
Advertisement
Answer
You can use it within expr()
which would allow you to concat the string and id_2
.
data_ls = [ ("1", "1001", '''{"1001":106, "2200":101}'''), ("1", "2200", '''{"1001":106, "2200":101}''') ] data_sdf = spark.createDataFrame(data_ls, ("id1", "id2", "jstr")) # +---+----+--------------------+ # |id1| id2| jstr| # +---+----+--------------------+ # | 1|1001|{"1001":106, "220...| # | 1|2200|{"1001":106, "220...| # +---+----+--------------------+ data_sdf. withColumn('val', func.expr('get_json_object(jstr, concat("$.", id2))')). show(truncate=False) # +---+----+------------------------+---+ # |id1|id2 |jstr |val| # +---+----+------------------------+---+ # |1 |1001|{"1001":106, "2200":101}|106| # |1 |2200|{"1001":106, "2200":101}|101| # +---+----+------------------------+---+