I have a simple book catalog dictionary as the following
{ 'key': { 'title': str, 'authors': [ { 'firstname': str, 'lastname': str } ], 'tags': [ str ], 'blob': str } }
Each book
is a string key in the dictionary. A book contains a single title
, and possibly has many authors
(often just one). An author
is made of two strings, firstname
and lastname
. Also we can associate many tags
to a book as novel, literature, art, 1900s, etc. Each book as a blob
field that contains additional data. (often the book itself). I want to be able to search for a given entry (or a group of them) based on data, as by author, by tag.
My main workflow would be:
- Given a query, return all blob fields associated to each entry.
My question is how to model this, which libraries or formats to use keeping the given constraints:
- Minimize the number of data objects (preference for a single data object to simplify queries).
- Small size of columns (create a new column for every possible tag is probably insane and lead to a very sparse dataset)
- Do not duplicate
blob
field (since it can be large).
My first idea was to create multiple rows for each author, for example:
{ '123': { 'title': 'A sample book', 'authors': [ {'firstname': 'John', 'lastname': 'Smith'}, {'firstname': 'Foos', 'lastname': 'M. Bar'} ] 'tags': [ 'tag1', 'tag2', 'tag3' ], 'blob': '.....' }
Would turn, initially into two entries as
idx | key | Title | authors_firstname | authors_lastname | tags | blob |
---|---|---|---|---|---|---|
0 | 123 | Sample Book | John | Smith | [‘tag1’, ‘tag2’, ‘tag3’] | … |
1 | 123 | Sample Book | Foos | M. Bar | [‘tag1’, ‘tag2’, ‘tag3’] | … |
But this still duplicates the blob, and still need to figure out what to do with the unknown number of tags (as the database grows).
Advertisement
Answer
You can use TinyDB
to accomplish what you want.
First, convert your dict to a database:
from tinydb import TinyDB, Query from tinydb.table import Document data = [{'123': {'title': 'A sample book', 'authors': [{'firstname': 'John', 'lastname': 'Smith'}, {'firstname': 'Foos', 'lastname': 'M. Bar'}], 'tags': ['tag1', 'tag2', 'tag3'], 'blob': 'blob1'}}, {'456': {'title': 'Another book', 'authors': [{'firstname': 'Paul', 'lastname': 'Roben'}], 'tags': ['tag1', 'tag3', 'tag4'], 'blob': 'blob2'}}] db = TinyDB('catalog.json') for record in data: db.insert(Document(list(record.values())[0], doc_id=list(record.keys())[0]))
Now you can make queries:
Book = Query() Author = Query() rows = db.search(Book.authors.any(Author.lastname == 'Smith')) rows = db.search(Book.tags.all(['tag1', 'tag4'])) rows = db.all()
Given a query, return all blob fields associated to each entry.
blobs = {row.doc_id: row['blob'] for row in db.all()}
>>> blobs {123: 'blob1', 456: 'blob2'}