I have a nested JSON dict that I need to convert to spark dataframe. This JSON dict is present in a dataframe column. I have been trying to parse the dict present in dataframe column using “from_json” and “get_json_object”, but have been unable to read the data. Here’s the smallest snippet of the source data that I’ve been trying to read:
{"value": "u0000u0000u0000u0000/{"context":"data"}"}
I need to extract the nested dict value. I used below code to clean the data and read it into a dataframe
from pyspark.sql.functions import * from pyspark.sql.types import * input_path = '/FileStore/tables/enrl/context2.json #path for the above file schema1 = StructType([StructField("context",StringType(),True)]) #Schema I'm providing raw_df = spark.read.json(input_path) cleansed_df = raw_df.withColumn("cleansed_value",regexp_replace(raw_df.value,'/','')).select('cleansed_value') #Removed extra '/' in the data cleansed_df.select(from_json('cleansed_value',schema=schema1)).show(1, truncate=False)
I get a null dataframe each time I run the above code. Please help.
Tried below stuff and it didn’t work: PySpark: Read nested JSON from a String Type Column and create columns
Also tried to write it to a JSON file and read it. It didn’t work as well: reading a nested JSON file in pyspark
Advertisement
Answer
The null chars u0000
affect the parsing of the JSON. You can replace them as well:
df = spark.read.json('path') df2 = df.withColumn( 'cleansed_value', F.regexp_replace('value','[u0000/]','') ).withColumn( 'parsed', F.from_json('cleansed_value','context string') ) df2.show(20,0) +-----------------------+------------------+------+ |value |cleansed_value |parsed| +-----------------------+------------------+------+ |/{"context":"data"}|{"context":"data"}|[data]| +-----------------------+------------------+------+