Skip to content
Advertisement

Splitting object data into new columns in dataframe

i have a dataframe with column business_id and attributes with thousands of rows like this:

+------------------------------------------------------------------------------------+-------------------+
|                                                                         attributes |        business_id|
+--------------------+---------------------------------------------------------------+-------------------+
|{"WiFi":"u'free","HasTV":"False","RestaurantsTableService":"True","Caters":"True".. |6iYb2HFDywm3zjuRg0q|
|{"HasTV:"False","Ambience":{'romantic': False, 'intimate': False,},"Price":"2" .... |7f4z43MHAV-l-LsRYsa|
+------------------------------------------------------------------------------------+-------------------+

how do create new column for each attribute with the value to the business id ? and if it’s not applicable to that business id, it will specify false.

example:

+---------------+-------------+-------------------+
|           Wifi|        Price|        business_id|
+---------------+-------------+-------------------+
|         u'free|        False|6iYb2HFDywm3zjuRg0q|
|          False|            2|7f4z43MHAV-l-LsRYsa|
+---------------+-------------+-------------------+

while also noting that there are some attributes with value as object in an object like:

{..."Ambience":{'romantic': False, 'intimate': False}...}

for this, if possible, i would like to tag that values to the attribute like as the column name like:

Ambience.romantic

Would someone please help me with this if you know how?

Advertisement

Answer

You could apply pd.json_normalize() to each of the dicts in attributes.

Update: if the elements in attributes are JSON strings, then we need to turn them into dict first. The following function will do that (and also accept dict):

import json

def json_normalize(s):
    if isinstance(s, str):
        s = json.loads(s)
    return pd.json_normalize(s)

Now, use this instead of pd.json_normalize() directly:

df2 = pd.concat(df['attributes'].apply(json_normalize).to_list(),
                keys=df.business_id)
>>> df2
                         WiFi  HasTV RestaurantsTableService Caters Price  
business_id                                                                 
6iYb2HFDywm3zjuRg0q 0  u'free  False                    True   True   NaN   
7f4z43MHAV-l-LsRYsa 0     NaN  False                     NaN    NaN     2   

                      Ambience.romantic Ambience.intimate  
business_id                                                
6iYb2HFDywm3zjuRg0q 0               NaN               NaN  
7f4z43MHAV-l-LsRYsa 0             False             False

Note, the above follows a reproducible setup:

# first, reproducible setup, with dicts
df = pd.DataFrame({
    'attributes': [
        {"WiFi":"u'free","HasTV":"False","RestaurantsTableService":"True","Caters":"True"},
        {"HasTV":"False", "Ambience":{'romantic': False, 'intimate': False,},"Price":"2"},
    ],
    'business_id': ['6iYb2HFDywm3zjuRg0q', '7f4z43MHAV-l-LsRYsa'],
})

# or, with strings
df = pd.DataFrame({
    'attributes': [
        """{"WiFi":"free","HasTV":"False","RestaurantsTableService":"True","Price":"3"}""",
        """{"HasTV":"False","Ambience":{"romantic": "False", "intimate": "False"},"Price":"2"}""",
    ],
    'business_id': ['6iYb2HFDywm3zjuRg0q', '7f4z43MHAV-l-LsRYsa'],
})
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement