Skip to content
Advertisement

Complex nested dict to pandas with multilevel index

I have a complex nested dictionary with multiple levels. I wish to create a multilevel index pandas dataframe from the dict.

The dict takes the format:

{
    "bucket_0" : [
        {
          "key" : 'foo',
          "level_1" : {
            "bucket_1" : [
              {
                "key" : 'bar',
                "level_2" : {
                  "bucket_2" : [
                    {
                      "key" : 'baz',
                      "max" : {
                        "value" : 100
                      }
                    },
                    .
                    .
                    .
                  ]
                }
              },
              .
              .
              .
            ]
          }
        },
        .
        .
        .
    ]
}

Basically, there are 3 arrays, each containing dicts. In the outer most level, bucket_0, we have X number of foo dicts. In the next level, bucket_1, we have Y number of bar dicts. In the last level, bucket_2, we have Z number of baz dicts.

I know I can iterate thru using 3 for loops, but that’s obviously not desirable. I’m aware of pandas json_normalize() method but am unsure how to use this effectively, especially when trying to create a multilevel index.

My desired result would be a multilevel index dataframe where a row would look like: ['foo', 'bar', 'baz', 'max.value']

Also tagged as Elasticsearch since this is a result from a nested aggregation query.

Advertisement

Answer

You can systematically expand the structure and build the index

js = {'bucket_0': [{'key': 'foo',
   'level_1': {'bucket_1': [{'key': 'bar',
      'level_2': {'bucket_2': [{'key': 'baz', 'max': {'value': 100}}]}}]}}]}

df = pd.json_normalize(js["bucket_0"]).explode("level_1.bucket_1").set_index("key")
df = df["level_1.bucket_1"].apply(pd.Series).set_index("key", append=True)
df = df["level_2"].apply(pd.Series).explode("bucket_2")
df = df["bucket_2"].apply(pd.Series).set_index("key", append=True)
df["max"].apply(pd.Series).rename_axis(index=["l1","l2","l3"])

output

            value
l1  l2  l3  
foo bar baz 100
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement