Skip to content
Advertisement

Parsing nested JSON with list comprehension in Python

My data is as following (this just extract but there are much more objects, some don’t have the additionalData)

{
   "referenceSetCount":1,
   "totalRowCount":4,
   "referenceSets":[
      {
         "name":"table",
         "rowCount":4,
         "_links":{
            "self":{
               "href":"link"
            }
         },
         "referenceDataItems":[
            {
               "col1":"5524",
               "col2":"yyy",
               "col3":1,
               "additionalData":[
                  {
                     "col1":111,
                     "col2":"xxxx",
                     "col3":1,
                     "col4":"18"
                  },
                  {
                     "col1":222,
                     "col2":"2222",
                     "col3":1,
                     "col4":"1"
                  }
               ]
            },
            {
               "col1":"26434",
               "col2":"dfdshere",
               "col3":2,
               "additionalData":[
                  {
                     "col1":34522,
                     "col2":"fsfs",
                     "col3":2,
                     "col4":"18"
                  },
                  {
                     "col1":5444,
                     "col2":"gregrege",
                     "col3":2,
                     "col4":"2"
                  }
               ]
            }
         ]
      }
   ]
}

I’m trying to iterate with list comprehension to get dataframe of referenceDataItems and everything within that key, also additionalData if appears.

import os
import sys
import pandas as pd
import urllib.request, json

api_url = urllib.request.urlopen("link_to_my_data")

api_data = json.loads(api_url.read())

#nest loop to get referenceSets + nested additionalData
data_alt = [v for k, v in api_data.items() if k == 'referenceSets']

Expected result:

col1    col2        col3    col1    col2    col3    col4    col1    col2        col3    col4
5524    yyy         1       111     xxxx    1       18      222     2222        1       1
26434   dfdshere    2       34522   fsfs    2       18      5444    gregrege    2       2

Advertisement

Answer

I did some research and this almost got my desired data, needs little modification in COLUMNS_TO_DROP

COLUMNS_TO_DROP = ["additionalData"]

def expand_additional_data(items):
    for item in items:
        for av in item.get("additionalData", []):
            item[av["col2a"]] = av["col4a"]
        yield item


for ref_set in data["referenceSets"]:
    table_name = ref_set["name"]
    expanded = expand_additional_data(ref_set["referenceDataItems"])
    df = pd.DataFrame(expanded)
    df = df.drop(COLUMNS_TO_DROP, axis=1, errors="ignore")
    print(df)
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement