I am working with CSV files where several of the columns have a simple json object (several key value pairs) while other columns are normal. Here is an example:
name,dob,stats john smith,1/1/1980,"{""eye_color"": ""brown"", ""height"": 160, ""weight"": 76}" dave jones,2/2/1981,"{""eye_color"": ""blue"", ""height"": 170, ""weight"": 85}" bob roberts,3/3/1982,"{""eye_color"": ""green"", ""height"": 180, ""weight"": 94}"
After using df = pandas.read_csv('file.csv')
, what’s the most efficient way to parse and split the stats
column into additional columns?
After about an hour, the only thing I could come up with was:
import json stdf = df['stats'].apply(json.loads) stlst = list(stdf) stjson = json.dumps(stlst) df.join(pandas.read_json(stjson))
This seems like I’m doing it wrong, and it’s quite a bit of work considering I’ll need to do this on three columns regularly.
Desired output is the dataframe object below. Added following lines of code to get there in my (crappy) way:
df = df.join(pandas.read_json(stjson)) del(df['stats']) In [14]: df Out[14]: name dob eye_color height weight 0 john smith 1/1/1980 brown 160 76 1 dave jones 2/2/1981 blue 170 85 2 bob roberts 3/3/1982 green 180 94
Advertisement
Answer
There is a slightly easier way, but ultimately you’ll have to call json.loads There is a notion of a converter in pandas.read_csv
converters : dict. optional Dict of functions for converting values in certain columns. Keys can either be integers or column labels
So first define your custom parser. In this case the below should work:
def CustomParser(data): import json j1 = json.loads(data) return j1
In your case you’ll have something like:
df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
We are telling read_csv to read the data in the standard way, but for the stats column use our custom parsers. This will make the stats column a dict
From here, we can use a little hack to directly append these columns in one step with the appropriate column names. This will only work for regular data (the json object needs to have 3 values or at least missing values need to be handled in our CustomParser)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
On the Left Hand Side, we get the new column names from the keys of the element of the stats column. Each element in the stats column is a dictionary. So we are doing a bulk assign. On the Right Hand Side, we break up the ‘stats’ column using apply to make a data frame out of each key/value pair.