Adding to a nested Dictionary in a Column in Pandas

Tags: ,



So, I have a fun issue. I have some data that have a fun nested dictionary that I need to manipulate, but am having trouble. I can do it in pure python, but wanted to do the entire solution in Pandas so as to keep the code a little cleaner and not have to re-open the same files elsewhere.

Dataframe:

Id             Timezone             Data
957643         Pacific             {"California":{"city":"San Francisco","pop":"874961"}}
973472         Eastern             {"New York":{"city":"New York","pop":"8419000"}}

Desired output, a list of dictionaries, placing the Id and Timezone into the nested dictionary while wrapping each in another key so I can just output as JSON:

[{"State Date":{"California":{"City":"San Francisco","Population":"874961","Id":"957643","Timezone":"Pacific"}}}, {"State Date":"New York":{"City":"New York","Population":"8419000", "Id":"973472","Timezone":"Eastern"}}}]

The problem is that I need to, for the final data format to be ingested elsewhere, place Timezone and Id into the nested portion of each row to be exported into JSON and rename some fields. I’ve tried the iterrows method and apply with an axis of 1 but it ends up putting all Id’s and timezones in each dictionary, though it does nest them.

A variation of below works in pure python when reading in the whole CSV, but not in Pandas (for reasons that are likely obvious for most). For brevity, these are the rows I’m interested in, so I drop any others.

output = []
entry = {}

for id_, time, data in the_states.iterrows():
        for state, other in data.items():
            entry['Id'] = id_
            entry['City'] = data.get('city')
            entry['Timezone'] = time
            entry['Population'] = data.get('pop')
            output.append({'State Data': entry})

Any help would be appreciated.

Answer

One way:

  1. Create another list of dict via to_dict('records').
  2. zip and iterate over both the list of dict.
  3. Update the 1st one with the other to get the desired JSON.
result = []
for i,j in zip(df.Data,df[['Id', 'Timezone']].to_dict('records')):
    for key in i.keys():
        i[key] = {**i[key], **j}
    result.append({'state_data': i})

OUTPUT:

[{'state_data': {'California': {'city': 'San Francisco',
    'pop': '874961',
    'Id': 957643,
    'Timezone': 'Pacific'}}},
 {'state_data': {'New York': {'city': 'New York',
    'pop': '8419000',
    'Id': 973472,
    'Timezone': 'Eastern'}}}]
Complete Example:
df = pd.DataFrame({'Id': {0: 957643, 1: 973472},
 'Timezone': {0: 'Pacific', 1: 'Eastern'},
 'Data': {0: {"California":{"city":"San Francisco","pop":"874961"}},
  1: {"New York":{"city":"New York","pop":"8419000"}}}})

result = []
for i,j in zip(df.Data,df[['Id', 'Timezone']].to_dict('records')):
    for key in i.keys():
        i[key] = {**i[key], **j}
    result.append({'state_data': i})


Source: stackoverflow