I have this sample dataset:
the_df = pd.DataFrame( {'id':['AM','AN','AP'], 'target':[130,60,180], 'moves':[[{'date':'2022-08-01','amount':285.0,'name':'Cookie'}, {'name':'Rush','amount':10,'date':'2022-08-02','type':'song'}], [{'amount':250.5,'date':'2022-08-01','source':{'data':'bing'}}],[]]}) the_df id target moves 0 AM 130 [{'date': '2022-08-01', 'amount': 285.0, 'name... 1 AN 60 [{'amount': 250.5, 'date': '2022-08-01', 'sour... 2 AP 180 []
And I want to ‘expand’ (or ‘explode’) each value in the json column, but only selecting some columns. This is the expected result:
id target date amount name 0 AM 130 2022-08-01 285.0 Cookie 1 AM 130 2022-08-02 10.0 Rush 2 AN 60 2022-08-01 250.5 3 AP 180
Firstly I tried using json_normalize
and iterate over each row (even when the last row has no data), but I have to know before how many rows I’m going to expand:
pd.json_normalize(the_df.moves[0]) date amount name type 0 2022-08-01 285.0 Cookie NaN 1 2022-08-02 10.0 Rush song pd.json_normalize(the_df.moves[1]) amount date source.data 0 250.5 2022-08-01 bing pd.json_normalize(the_df.moves[2])
I only want keys date
,amount
and name
. So I tried this:
temp_df = pd.DataFrame(columns=['date','amount','name']) for i in range(len(the_df)): temp_df = temp_df.append(pd.json_normalize(the_df.moves[i])) temp_df date amount name type source.data 0 2022-08-01 285.0 Cookie NaN NaN 1 2022-08-02 10.0 Rush song NaN 0 2022-08-01 250.5 NaN NaN bing
But my data frame temp_df
needs a reference to the original dataset the_df
to apply a merge. Please, could you guide me to the right solution? I guess there must be a way to recall the id, or a method in pandas to do this without a for loop.
Advertisement
Answer
Here are the steps you could follow
(1) define df
df = pd.DataFrame( {'id':['AM','AN','AP'], 'target':[130,60,180], 'moves':[[{'date':'2022-08-01','amount':285.0,'name':'Cookie'}, {'name':'Rush','amount':10,'date':'2022-08-02','type':'song'}], [{'amount':250.5,'date':'2022-08-01','source':{'data':'bing'}}],[]]}) print(df) id target moves 0 AM 130 [{'date': '2022-08-01', 'amount': 285.0, 'name': 'Cookie'}, {'name': 'Rush', 'amount': 10, 'date': '2022-08-02', 'type': 'song'}] 1 AN 60 [{'amount': 250.5, 'date': '2022-08-01', 'source': {'data': 'bing'}}] 2 AP 180 []
(2) explode the column ‘moves’
df1 = df.explode('moves', ignore_index=True) print(df1) id target moves 0 AM 130 {'date': '2022-08-01', 'amount': 285.0, 'name': 'Cookie'} 1 AM 130 {'name': 'Rush', 'amount': 10, 'date': '2022-08-02', 'type': 'song'} 2 AN 60 {'amount': 250.5, 'date': '2022-08-01', 'source': {'data': 'bing'}} 3 AP 180 NaN
(3) json_normalize the column ‘moves’
df2 = pd.json_normalize(df1['moves']) print(df2) date amount name type source.data 0 2022-08-01 285.0 Cookie NaN NaN 1 2022-08-02 10.0 Rush song NaN 2 2022-08-01 250.5 NaN NaN bing 3 NaN NaN NaN NaN NaN
(4) concat the 2 df
with only the relevant columns
df3 = pd.concat([df1[['id', 'target']], df2[['date', 'amount', 'name']]], axis=1) print(df3) id target date amount name 0 AM 130 2022-08-01 285.0 Cookie 1 AM 130 2022-08-02 10.0 Rush 2 AN 60 2022-08-01 250.5 NaN 3 AP 180 NaN NaN NaN