Skip to content
Advertisement

Modeling a dictionary as a queryable data object in python

I have a simple book catalog dictionary as the following

{ 
  'key':
  {
  'title': str,
  'authors': [ {
                 'firstname': str,
                 'lastname': str 
               }
             ],
  'tags': [ str ],
  'blob': str
  }
}

Each book is a string key in the dictionary. A book contains a single title, and possibly has many authors (often just one). An author is made of two strings, firstname and lastname. Also we can associate many tags to a book as novel, literature, art, 1900s, etc. Each book as a blob field that contains additional data. (often the book itself). I want to be able to search for a given entry (or a group of them) based on data, as by author, by tag.

My main workflow would be:

  1. Given a query, return all blob fields associated to each entry.

My question is how to model this, which libraries or formats to use keeping the given constraints:

  • Minimize the number of data objects (preference for a single data object to simplify queries).
  • Small size of columns (create a new column for every possible tag is probably insane and lead to a very sparse dataset)
  • Do not duplicate blob field (since it can be large).

My first idea was to create multiple rows for each author, for example:

{ '123': { 'title': 'A sample book',
           'authors': [ {'firstname': 'John', 'lastname': 'Smith'},
                        {'firstname': 'Foos', 'lastname': 'M. Bar'} ]
           'tags': [ 'tag1', 'tag2', 'tag3' ],
           'blob': '.....'
} 

Would turn, initially into two entries as

idx key Title authors_firstname authors_lastname tags blob
0 123 Sample Book John Smith [‘tag1’, ‘tag2’, ‘tag3’]
1 123 Sample Book Foos M. Bar [‘tag1’, ‘tag2’, ‘tag3’]

But this still duplicates the blob, and still need to figure out what to do with the unknown number of tags (as the database grows).

Advertisement

Answer

You can use TinyDB to accomplish what you want.

First, convert your dict to a database:

from tinydb import TinyDB, Query
from tinydb.table import Document

data = [{'123': {'title': 'A sample book',
                 'authors': [{'firstname': 'John', 'lastname': 'Smith'},
                             {'firstname': 'Foos', 'lastname': 'M. Bar'}],
                 'tags': ['tag1', 'tag2', 'tag3'],
                 'blob': 'blob1'}},
        {'456': {'title': 'Another book',
                 'authors': [{'firstname': 'Paul', 'lastname': 'Roben'}],
                 'tags': ['tag1', 'tag3', 'tag4'],
                 'blob': 'blob2'}}]

db = TinyDB('catalog.json')
for record in data:
    db.insert(Document(list(record.values())[0], doc_id=list(record.keys())[0]))

Now you can make queries:

Book = Query()
Author = Query()

rows = db.search(Book.authors.any(Author.lastname == 'Smith'))
rows = db.search(Book.tags.all(['tag1', 'tag4']))
rows = db.all()

Given a query, return all blob fields associated to each entry.

blobs = {row.doc_id: row['blob'] for row in db.all()}
>>> blobs
{123: 'blob1', 456: 'blob2'}
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement