There dataframe below has columns with mixed types. Column of interest for expansion is “Info”. Each row value in this column is a JSON object.
data = {'Code':['001', '002', '003', '004'], 'Info':['{"id":001,"x_cord":[1,1,1,1],"x_y_cord":[4.703978,-39.601876],"neutral":1,"code_h":"S38A46","group":null}','{"id":002,"x_cord":[2,1,3,1],"x_y_cord":[1.703978,-38.601876],"neutral":2,"code_h":"S17A46","group":"New"}','{"id":003,"x_cord":[1,1,4,1],"x_y_cord":[112.703978,-9.601876],"neutral":4,"code_h":"S12A46","group":"Old"}','{"id":004,"x_cord":[2,1,7,1],"x_y_cord":[6.703978,-56.601876],"neutral":1,"code_h":"S12A46","group":null}'], 'Region':['US','Pacific','Africa','Asia']} df = pd.DataFrame(data)
I would like to have the headers expanded i.e. have “Info.id”,”info.x_y_cord”,”info.neutral” etc as individual columns with corresponding values under them across the dataset. I’ve tried normalizing them via pd.json_normalize(df[“Info”]) iteration but nothing seems to change. Do I need to convert the column to another type first? Can someone point me to the right direction?
The output should be something like this:
data1 = {'Code':['001', '002', '003', '004'], 'Info.id':['001','002','003','004'], 'Info.x_cord':['[1,1,1,1]','[2,1,3,1]','[1,1,4,1]','[2,1,7,1]'], 'Info.x_y_cord':['[4.703978,-39.601876]','[1.703978,-38.601876]','[112.703978,-9.601876]','[6.703978,-56.601876]'], 'Info.neutral':[1,2,4,1], 'Info.code_h':['S38A46','S17A46','S12A46','S12A46'], 'Info.group':[np.NaN,"New","Old",np.NaN], 'Region':['US','Pacific','Africa','Asia']} df_final = pd.DataFrame(data1)
Advertisement
Answer
First of all, your JSON strings seem to be not valid because of the ID value. 001
is not processed correctly so you’ll need to pass the “id” value as a string instead. Here’s one way to do that:
def id_as_string(matchObj): # Adds " around the ID value return f""id":"{matchObj.group(1)}"," df["Info"] = df["Info"].str.replace(""id":(d*),", repl=id_to_string, regex=True))
Once you’ve done that, you can use pd.json_normalize
on your “Info” column after you’ve loaded the values from the JSON strings using json.loads
:
import json json_part_df = pd.json_normalize(df["Info"].map(json.loads))
After that, just rename the columns and use pd.concat
to form the output dataframe:
# Rename columns json_part_df.columns = [f"Info.{column}" for column in json_part_df.columns] # Use pd.concat to create output df = pd.concat([df[["Code", "Region"]], json_part_df], axis=1)