Skip to content
Advertisement

How to access data and handle missing data in a dictionaries within a dataframe

Given, df:

import pandas as pd
import numpy as np

data =
{'Col1': [1, 2, 3],
 'Person': [{'ID': 10001,
   'Data': {'Address': {'Street': '1234 Street A',
     'City': 'Houston',
     'State': 'Texas',
     'Zip': '77002'}},
   'Age': 30,
   'Income': 50000},
  {'ID': 10002,
   'Data': {'Address': {'Street': '7892 Street A',
     'City': 'Greenville',
     'State': 'Maine',
     'Zip': np.nan}},
   'Age': np.nan,
   'Income': 63000},
  {'ID': 10003, 'Data': {'Address': np.nan}, 'Age': 56, 'Income': 85000}]}

df = pd.DataFrame(data)

Input Dataframe:

   Col1                                                                                                                                            Person
0     1  {'ID': 10001, 'Data': {'Address': {'Street': '1234 Street A', 'City': 'Houston', 'State': 'Texas', 'Zip': '77002'}}, 'Age': 30, 'Income': 50000}
1     2  {'ID': 10002, 'Data': {'Address': {'Street': '7892 Street A', 'City': 'Greenville', 'State': 'Maine', 'Zip': nan}}, 'Age': nan, 'Income': 63000}
2     3                                                                               {'ID': 10003, 'Data': {'Address': nan}, 'Age': 56, 'Income': 85000}

My expected output dataframe is df[['Col1', 'Income', 'Age', 'Street', 'Zip']] where Income, Age, Street, and Zip come from within Person:

   Col1  Income   Age         Street    Zip
0     1   50000  30.0  1234 Street A  77002
1     2   63000   NaN  7892 Street A    nan
2     3   85000  56.0            NaN    nan

Advertisement

Answer

Using list comprehension, we can create most of these columns.

df['Income'] = [x.get('Income') for x in df['Person']]
df['Age'] = [x.get('Age') for x in df['Person']]
df['Age']

Output:

0    30.0
1     NaN
2    56.0
Name: Age, dtype: float64

However, dealing with np.nan values inside a nested dictionary is a real pain. Let’s look at getting data from a nested dictionary data where one of the values is nan.

df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]

We get an AttributeError:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-80-cc2f92bfe95d> in <module>
      1 #However, let's look at getting data rom a nested dictionary where one of the values is nan.
      2 
----> 3 df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
      4 
      5 #We get and AttributeError because NoneType object has no get method

<ipython-input-80-cc2f92bfe95d> in <listcomp>(.0)
      1 #However, let's look at getting data rom a nested dictionary where one of the values is nan.
      2 
----> 3 df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
      4 
      5 #We get and AttributeError because NoneType object has no get method

AttributeError: 'float' object has no attribute 'get'

Let’s use the .str accessor with dictionary keys to fetch this data.
There is little documentation in pandas that shows how you can use .str.get or .str[] to fetch values from dictionary objects in a dataframe column/pandas series.

df['Street'] = df['Person'].str['Data'].str['Address'].str['Street']

Output:

0    1234 Street A
1    7892 Street A
2              NaN
Name: Street, dtype: object

And, likewise with

df['Zip'] = df['Person'].str['Data'].str['Address'].str['Zip']

Leaving us with the columns to build the desired dataframe df[['Col1', 'Income', 'Age', 'Street', 'Zip']] from dictionaries.

Output:

   Col1  Income   Age         Street    Zip
0     1   50000  30.0  1234 Street A  77002
1     2   63000   NaN  7892 Street A    NaN
2     3   85000  56.0            NaN    NaN
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement