Skip to content
Advertisement

Generalize algorithm for a loop comparing to last record?

I have a data set which I can represent by this toy example of a list of dictionaries:

data = [{
        "_id" : "001",
        "Location" : "NY",
        "start_date" : "2022-01-01T00:00:00Z",
        "Foo" : "fruits"
    },
        {
        "_id" : "002",
        "Location" : "NY",
        "start_date" : "2022-01-02T00:00:00Z",
        "Foo" : "fruits"
    },
    {
        "_id" : "011",
        "Location" : "NY",
        "start_date" : "2022-02-01T00:00:00Z",
        "Bar" : "vegetables"
    },
        {
        "_id" : "012",
        "Location" : "NY",
        "Start_Date" : "2022-02-02T00:00:00Z",
        "Bar" : "vegetables"
    },
    {
        "_id" : "101",
        "Location" : "NY",
        "Start_Date" : "2022-03-01T00:00:00Z",
        "Baz" : "pizza"
    },
        {
        "_id" : "102",
        "Location" : "NY",
        "Start_Date" : "2022-03-2T00:00:00Z",
        "Baz" : "pizza"
    },
]

Here is an algorithm in Python which collects each of the keys in each ‘collection’ and whenever there is a key change, the algorithm adds those keys to output.

data_keys = []
for i, lst in enumerate(data):
    all_keys = []
    for k, v in lst.items():
        all_keys.append(k)
        if k.lower() == 'start_date':
            start_date = v
    this_coll = {'start_date': start_date, 'all_keys': all_keys}
    if i == 0:
        data_keys.append(this_coll)
    else:
        last_coll = data_keys[-1]
        if this_coll['all_keys'] == last_coll['all_keys']:
            continue
        else:
            data_keys.append(this_coll)

The correct output given here records each change of field name: Foo, Bar, Baz as well as the change of case in field start_date to Start_Date:

[{'start_date': '2022-01-01T00:00:00Z',
  'all_keys': ['_id', 'Location', 'start_date', 'Foo']},
 {'start_date': '2022-02-01T00:00:00Z',
  'all_keys': ['_id', 'Location', 'start_date', 'Bar']},
 {'start_date': '2022-02-02T00:00:00Z',
  'all_keys': ['_id', 'Location', 'Start_Date', 'Bar']},
 {'start_date': '2022-03-01T00:00:00Z',
  'all_keys': ['_id', 'Location', 'Start_Date', 'Baz']}]

Is there a general algorithm which covers this pattern comparing current to previous item in a stack?

I need to generalize this algorithm and find a solution to do exactly the same thing with MongoDB documents in a collection. In order for me to discover if Mongo has an Aggregation Pipeline Operator which I could use, I must first understand if this basic algorithm has other common forms so I know what to look for.

Or someone who knows MongoDB aggregation pipelines really well could suggest operators which would produce the desired result?

Advertisement

Answer

EDIT: If you want to use a query for this, one option is something like:

  1. The $objectToArray allow to format the keys as values, and the $ifNull allows to check several options of start_date.
  2. The $unwind allows us to sort the keys.
  3. The $group allow us to undo the $unwind, but now with sorted keys
  4. $reduce to create a string from all keys, so we’ll have something to compare.
  5. group again, but now with our string, so we’ll only have documents for changes.
db.collection.aggregate([
  {
    $project: {
      data: {$objectToArray: "$$ROOT"},
      start_date: {$ifNull: ["$start_date", "$Start_Date"]}
    }
  },
  {$unwind: "$data"},
  {$project: {start_date: 1, key: "$data.k", _id: 0}},
  {$sort: {start_date: 1,  key: 1}},
  {$group: {_id: "$start_date", all_keys: {$push: "$key"}}},
  {
    $project: {
      all_keys: 1,
      all_keys_string: {
        $reduce: {
          input: "$all_keys",
          initialValue: "",
          in: {$concat: ["$$value", "$$this"]}
        }
      }
    }
  },
  {
    $group: {
      _id: "$all_keys_string",
      all_keys: {$first: "$all_keys"},
      start_date: {$first: "$_id"}
    }
  },
  {$unset: "_id"}
])

Playground example

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement