How do I convert a flat table into a JSON?
I have previously converted JSONs into Flat Tables using both custom code and libraries. However, what I am aiming to do here is the reverse. Before going ahead and creating a custom library, I was wondering if anyone had encountered this problem before and if there was an existing solution to it.
When you flatten a JSON into a CSV, you loose the information on the structure, and therefore to reverse it, you need a document that describes how the JSON should be built, which ideally would be the standardised JSON Schema.
The following example shows a source CSV, the JSON Schema and the expected output.
User CSV
user_id, adress.city, address.street, address.number, name, aka, contacts.name, contacts.relationship 1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Morty, Grandson 1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Morty, Grandson 1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Morty, Grandson 1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Morty, Grandson 1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Beth, Daughter 1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Beth, Daughter 1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Beth, Daughter 1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Beth, Daughter
JSON Schema
This follows the defined standard with the addition of the “source” property. I am suggesting adding this custom property to this specific problem in order to map between the csv columns and the JSON values (leafs).
{ "$schema": "https://json-schema.org/draft/2020-12/schema", "title": "User", "type": "object", "properties":{ "user_id" : {"type":"integer", "source":"user_id"}, "address":{ "type":"object", "properties":{ "city" : {"type":"string", "source":"adress.city"}, "street" : {"type":"string", "source":"adress.street"}, "number": { "type":"integer", "source":"adress.number"} } }, "name" : {"type":"string", "source":"name"}}, "aka":{ "type": "array", "items" : {"type":"string", "source":"aka"} }, "contacts":{ "type":"array", "items":{ "type":"object", "properties":{ "name" : {"type":"string", "source":"contacts.name"}, "relationship":{"type":"string", "source":"contacts.relationship"} }, } } } }
Expected JSON
{ "user_id":1, "address":{ "city":"Seattle", "street":"Atomic Street", "number":6910 }, "name":"Rick Sanchez", "aka":[ "Rick", "Grandpa", "Albert Ein-douche", "Richard" ], "contacts":[ { "name":"Morty", "relationship":"Grandson" }, { "name":"Beth", "relationship":"Daughter" } ] }
From the above we see that although there are 8 rows in the CSV, we are producing a single JSON Object (instead of 8) since there is only one unique user (user_id = 1). This could be inferred from the JSON Schema where the root element is an object and not a list.
If we did not specify a JSON Schema or other kind of mapping, you could simply assume no structure and just create 8 flat jsons as below
[ {"user_id":1,"address.city":"Seattle", ... "aka":"Rick" ... "contacts.relationship":"Grandson"} ... {"user_id":1,"address.city":"Seattle", ... "aka":"Richard" ... "contacts.relationship":"Daughter"} ]
I am adding the Python tag since that is the language I use mostly, but in this case, the solution doesn’t need to be in Python.
Advertisement
Answer
I’m not entirely clear on why JSON schema would be needed for this, but if you wanted to, you could easily create a convenience function which can essentially “unflatten” the flat JSON that your CSV data would be mapped to, into a nested dictionary format as mentioned above.
The following example should demonstrate a simplified example of how this would work. Note the following two points:
In the CSV header, I’ve corrected a typo and renamed one of the columns to
address.city
; previously, it wasadress.city
, which would result in it getting mapped to another JSON path under a separateadress
key, which might not be desirable.I wasn’t sure of the best way to handle this, but it looks like
csv
module only allows a single-character delimiter; in the CSV file, it looks like you have a comma and a space,
as the separator, so I’ve just replaced all occurrences of this with a single comma,
so that the split on the delimiter works as expected.
from csv import DictReader from io import StringIO from typing import Any csv_data = StringIO(""" user_id, address.city, address.street, address.number, name, aka, contacts.name, contacts.relationship 1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Morty, Grandson 1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Morty, Grandson 1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Morty, Grandson 1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Morty, Grandson 1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Beth, Daughter 1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Beth, Daughter 1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Beth, Daughter 1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Beth, Daughter """.replace(', ', ',') ) def unflatten_json(json_dict: dict): """Unflatten a JSON dictionary object, with keys like 'a.b.c'""" result_dict = {} for k, v in json_dict.items(): *nested_parts, field_name = k.split('.') obj = result_dict for p in nested_parts: obj = obj.setdefault(p, {}) obj[field_name] = v return result_dict def main(): reader = DictReader(csv_data) flat_json: list[dict[str, Any]] = list(reader) first_obj = flat_json[0] nested_dict = unflatten_json(first_obj) print('Flat JSON: ', first_obj) print('Nested JSON: ', nested_dict) if __name__ == '__main__': main()
The output is given below:
Flat JSON: {'user_id': '1', 'address.city': 'Seattle', 'address.street': 'Atomic Street', 'address.number': '6910', 'name': 'Rick Sanchez', 'aka': 'Rick', 'contacts.name': 'Morty', 'contacts.relationship': 'Grandson'} Nested JSON: {'user_id': '1', 'address': {'city': 'Seattle', 'street': 'Atomic Street', 'number': '6910'}, 'name': 'Rick Sanchez', 'aka': 'Rick', 'contacts': {'name': 'Morty', 'relationship': 'Grandson'}}
Note, if you want to unflatten all JSON dictionary objects in the list, you could instead use a list
comprehension as below:
result_list = [unflatten_json(d) for d in flat_json]
I would also point out that the above solution isn’t perfect, as it will pass in everything as string values, for example in the case of 'user_id': '1'
. To work around that, you can modify the unflatten_json
function so it is like below:
... for k, v in json_dict.items(): ... try: v = int(v) except ValueError: pass obj[field_name] = v
Now the unflattened JSON object should be as below. Note that I’m pretty printing it with json.dumps(nested_dict, indent=2)
so it’s a little easier to see.
{ "user_id": 1, "address": { "city": "Seattle", "street": "Atomic Street", "number": 6910 }, "name": "Rick Sanchez", "aka": "Rick", "contacts": { "name": "Morty", "relationship": "Grandson" } }
Complete Solution
The full solution to achieve the desired output (data for all rows appended to aka
and contacts
) is provided below:
from csv import DictReader from io import StringIO from pprint import pprint csv_data = StringIO(""" user_id, address.city, address.street, address.number, name, aka, contacts.name, contacts.relationship 1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Morty, Grandson 1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Morty, Grandson 1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Morty, Grandson 1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Morty, Grandson 1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Beth, Daughter 1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Beth, Daughter 1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Beth, Daughter 1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Beth, Daughter """.replace(', ', ',') ) def unflatten_json(json_dict: dict[str, str]): """Unflatten a JSON dictionary object, with keys like 'a.b.c'""" result_dict = {} for k, v in json_dict.items(): *nested_parts, field_name = k.split('.') obj = result_dict for p in nested_parts: obj = obj.setdefault(p, {}) obj[field_name] = int(v) if v.isnumeric() else v return result_dict def main(): reader = DictReader(csv_data) rows = list(map(unflatten_json, reader)) # retrieve the first element in the (unflattened) sequence result_obj = rows[0] # define list fields that we want to merge data for list_fields = ('aka', 'contacts') # now loop through, and for all rows merge the data for these fields for field in list_fields: result_obj[field] = [row[field] for row in rows] print('Result object:') pprint(result_obj) if __name__ == '__main__': main()
This should have the desired result as also noted in the question:
Result object: {'address': {'city': 'Seattle', 'number': 6910, 'street': 'Atomic Street'}, 'aka': ['Rick', 'Grandpa', 'Albert Ein-douche', 'Richard', 'Rick', 'Grandpa', 'Albert Ein-douche', 'Richard'], 'contacts': [{'name': 'Morty', 'relationship': 'Grandson'}, {'name': 'Morty', 'relationship': 'Grandson'}, {'name': 'Morty', 'relationship': 'Grandson'}, {'name': 'Morty', 'relationship': 'Grandson'}, {'name': 'Beth', 'relationship': 'Daughter'}, {'name': 'Beth', 'relationship': 'Daughter'}, {'name': 'Beth', 'relationship': 'Daughter'}, {'name': 'Beth', 'relationship': 'Daughter'}], 'name': 'Rick Sanchez', 'user_id': 1}