df = pd.DataFrame.from_dict(dict_name, orient='index') df.fillna('NaN', inplace=True) df.to_csv('taxonomy_3.csv', index=True, header=True)
The above code handles a nested dictionary to dataframe conversion perfectly fine but if you have a nested dictionary created with the .append()
or .extend()
method it adds extraneous brackets[]
and quotes ''
which is making downstream analysis difficult.
For example for a nested dictionary like this:
{'Ceratopteris richardii': {'superkingdom': ['Eukaryota'], 'kingdom': ['Viridiplantae'], 'phylum': ['Streptophyta'], 'subphylum': ['Streptophytina'], 'clade': ['Embryophyta', 'Tracheophyta', 'Euphyllophyta'], 'class': ['Polypodiopsida'], 'subclass': ['Polypodiidae'], 'order': ['Polypodiales'], 'suborder': ['Pteridineae'], 'family': ['Pteridaceae'], 'subfamily': ['Parkerioideae'], 'genus': ['Ceratopteris']}, 'Arabidopsis thaliana': {'superkingdom': ['Eukaryota'], 'kingdom': ['Viridiplantae'], 'phylum': ['Streptophyta'], 'subphylum': ['Streptophytina'], 'clade': ['Embryophyta', 'Tracheophyta', 'Euphyllophyta', 'Spermatophyta', 'Mesangiospermae', 'eudicotyledons', 'Gunneridae', 'Pentapetalae', 'rosids', 'malvids'], 'class': ['Magnoliopsida'], 'order': ['Brassicales'], 'family': ['Brassicaceae'], 'tribe': ['Camelineae'], 'genus': ['Arabidopsis']}}
created with the setup:
line = line.strip()# remove newline character words = line.split("t",1) # split the line at the first tab if words[0] in taxonomy[name]: # add value if key already exists taxonomy[name][words[0]].append(words[1]) else: # add key and value if key does not exist taxonomy[name][words[0]] = [words[1]]
And converted to a dataframe with pd.dataframe.from_dict()
Creates a table that looks like this:
Columns one | Column two |
---|---|
Key1 | [‘Value1′,’Value2′,’value3’] |
Key2 | [‘Value2′,’value4′,’value5’] |
here the cells become a single lump of strings and lose a level of data
While something like would be more ideal to preserve a whole level of data:
Columns one | Column two |
---|---|
Key1 | Value1,Value2,value3 |
Key2 | Value2,value4,value5 |
It seems the extraneous characters are essential delimiters and can’t be done without when updating keys, so best I can tell that rules out extending the values without brackets or quotes.
What would be more appropriate:
- Try to convert to dataframe from dictionary and remove extraneous characters in conversion? If so, how?
- Remove brackets and quotes with regex once the dataframe is created?
Advertisement
Answer
One option is to stack
the columns, join
the strings, then unstack
:
out = pd.DataFrame(my_data).stack().map(', '.join).unstack()
But it’s probably more efficient to modify the input dictionary in vanilla Python first and then construct the DataFrame:
for d in my_data.values(): for k,v in d.items(): d[k] = ', '.join(v) out = pd.DataFrame(my_data)
Output:
Ceratopteris richardii Arabidopsis thaliana superkingdom Eukaryota Eukaryota kingdom Viridiplantae Viridiplantae phylum Streptophyta Streptophyta subphylum Streptophytina Streptophytina clade Embryophyta, Tracheophyta, Euphyllophyta Embryophyta, Tracheophyta, Euphyllophyta, Sper... class Polypodiopsida Magnoliopsida subclass Polypodiidae NaN order Polypodiales Brassicales suborder Pteridineae NaN family Pteridaceae Brassicaceae subfamily Parkerioideae NaN genus Ceratopteris Arabidopsis tribe NaN Camelineae