I have a data set which I can represent by this toy example of a list of dictionaries:
data = [{ "_id" : "001", "Location" : "NY", "start_date" : "2022-01-01T00:00:00Z", "Foo" : "fruits" }, { "_id" : "002", "Location" : "NY", "start_date" : "2022-01-02T00:00:00Z", "Foo" : "fruits" }, { "_id" : "011", "Location" : "NY", "start_date" : "2022-02-01T00:00:00Z", "Bar" : "vegetables" }, { "_id" : "012", "Location" : "NY", "Start_Date" : "2022-02-02T00:00:00Z", "Bar" : "vegetables" }, { "_id" : "101", "Location" : "NY", "Start_Date" : "2022-03-01T00:00:00Z", "Baz" : "pizza" }, { "_id" : "102", "Location" : "NY", "Start_Date" : "2022-03-2T00:00:00Z", "Baz" : "pizza" }, ]
Here is an algorithm in Python which collects each of the keys in each ‘collection’ and whenever there is a key change, the algorithm adds those keys to output.
data_keys = [] for i, lst in enumerate(data): all_keys = [] for k, v in lst.items(): all_keys.append(k) if k.lower() == 'start_date': start_date = v this_coll = {'start_date': start_date, 'all_keys': all_keys} if i == 0: data_keys.append(this_coll) else: last_coll = data_keys[-1] if this_coll['all_keys'] == last_coll['all_keys']: continue else: data_keys.append(this_coll)
The correct output given here records each change of field name: Foo
, Bar
, Baz
as well as the change of case in field start_date
to Start_Date
:
[{'start_date': '2022-01-01T00:00:00Z', 'all_keys': ['_id', 'Location', 'start_date', 'Foo']}, {'start_date': '2022-02-01T00:00:00Z', 'all_keys': ['_id', 'Location', 'start_date', 'Bar']}, {'start_date': '2022-02-02T00:00:00Z', 'all_keys': ['_id', 'Location', 'Start_Date', 'Bar']}, {'start_date': '2022-03-01T00:00:00Z', 'all_keys': ['_id', 'Location', 'Start_Date', 'Baz']}]
Is there a general algorithm which covers this pattern comparing current to previous item in a stack?
I need to generalize this algorithm and find a solution to do exactly the same thing with MongoDB documents in a collection. In order for me to discover if Mongo has an Aggregation Pipeline Operator which I could use, I must first understand if this basic algorithm has other common forms so I know what to look for.
Or someone who knows MongoDB aggregation pipelines really well could suggest operators which would produce the desired result?
Advertisement
Answer
EDIT: If you want to use a query for this, one option is something like:
- The
$objectToArray
allow to format the keys as values, and the$ifNull
allows to check several options ofstart_date
. - The
$unwind
allows us to sort the keys. - The
$group
allow us to undo the$unwind
, but now with sorted keys $reduce
to create a string from all keys, so we’ll have something to compare.- group again, but now with our string, so we’ll only have documents for changes.
db.collection.aggregate([ { $project: { data: {$objectToArray: "$$ROOT"}, start_date: {$ifNull: ["$start_date", "$Start_Date"]} } }, {$unwind: "$data"}, {$project: {start_date: 1, key: "$data.k", _id: 0}}, {$sort: {start_date: 1, key: 1}}, {$group: {_id: "$start_date", all_keys: {$push: "$key"}}}, { $project: { all_keys: 1, all_keys_string: { $reduce: { input: "$all_keys", initialValue: "", in: {$concat: ["$$value", "$$this"]} } } } }, { $group: { _id: "$all_keys_string", all_keys: {$first: "$all_keys"}, start_date: {$first: "$_id"} } }, {$unset: "_id"} ])