i have a dataframe with column business_id and attributes with thousands of rows like this:
+------------------------------------------------------------------------------------+-------------------+ | attributes | business_id| +--------------------+---------------------------------------------------------------+-------------------+ |{"WiFi":"u'free","HasTV":"False","RestaurantsTableService":"True","Caters":"True".. |6iYb2HFDywm3zjuRg0q| |{"HasTV:"False","Ambience":{'romantic': False, 'intimate': False,},"Price":"2" .... |7f4z43MHAV-l-LsRYsa| +------------------------------------------------------------------------------------+-------------------+
how do create new column for each attribute with the value to the business id ? and if it’s not applicable to that business id, it will specify false.
example:
+---------------+-------------+-------------------+ | Wifi| Price| business_id| +---------------+-------------+-------------------+ | u'free| False|6iYb2HFDywm3zjuRg0q| | False| 2|7f4z43MHAV-l-LsRYsa| +---------------+-------------+-------------------+
while also noting that there are some attributes with value as object in an object like:
{..."Ambience":{'romantic': False, 'intimate': False}...}
for this, if possible, i would like to tag that values to the attribute like as the column name like:
Ambience.romantic
Would someone please help me with this if you know how?
Advertisement
Answer
You could apply pd.json_normalize()
to each of the dict
s in attributes
.
Update: if the elements in attributes
are JSON strings, then we need to turn them into dict first. The following function will do that (and also accept dict
):
import json def json_normalize(s): if isinstance(s, str): s = json.loads(s) return pd.json_normalize(s)
Now, use this instead of pd.json_normalize()
directly:
df2 = pd.concat(df['attributes'].apply(json_normalize).to_list(), keys=df.business_id) >>> df2 WiFi HasTV RestaurantsTableService Caters Price business_id 6iYb2HFDywm3zjuRg0q 0 u'free False True True NaN 7f4z43MHAV-l-LsRYsa 0 NaN False NaN NaN 2 Ambience.romantic Ambience.intimate business_id 6iYb2HFDywm3zjuRg0q 0 NaN NaN 7f4z43MHAV-l-LsRYsa 0 False False
Note, the above follows a reproducible setup:
# first, reproducible setup, with dicts df = pd.DataFrame({ 'attributes': [ {"WiFi":"u'free","HasTV":"False","RestaurantsTableService":"True","Caters":"True"}, {"HasTV":"False", "Ambience":{'romantic': False, 'intimate': False,},"Price":"2"}, ], 'business_id': ['6iYb2HFDywm3zjuRg0q', '7f4z43MHAV-l-LsRYsa'], }) # or, with strings df = pd.DataFrame({ 'attributes': [ """{"WiFi":"free","HasTV":"False","RestaurantsTableService":"True","Price":"3"}""", """{"HasTV":"False","Ambience":{"romantic": "False", "intimate": "False"},"Price":"2"}""", ], 'business_id': ['6iYb2HFDywm3zjuRg0q', '7f4z43MHAV-l-LsRYsa'], })