I have a data set which I can represent by this toy example of a list of dictionaries:
data = [{
"_id" : "001",
"Location" : "NY",
"start_date" : "2022-01-01T00:00:00Z",
"Foo" : "fruits"
},
{
"_id" : "002",
"Location" : "NY",
"start_date" : "2022-01-02T00:00:00Z",
"Foo" : "fruits"
},
{
"_id" : "011",
"Location" : "NY",
"start_date" : "2022-02-01T00:00:00Z",
"Bar" : "vegetables"
},
{
"_id" : "012",
"Location" : "NY",
"Start_Date" : "2022-02-02T00:00:00Z",
"Bar" : "vegetables"
},
{
"_id" : "101",
"Location" : "NY",
"Start_Date" : "2022-03-01T00:00:00Z",
"Baz" : "pizza"
},
{
"_id" : "102",
"Location" : "NY",
"Start_Date" : "2022-03-2T00:00:00Z",
"Baz" : "pizza"
},
]
Here is an algorithm in Python which collects each of the keys in each ‘collection’ and whenever there is a key change, the algorithm adds those keys to output.
data_keys = []
for i, lst in enumerate(data):
all_keys = []
for k, v in lst.items():
all_keys.append(k)
if k.lower() == 'start_date':
start_date = v
this_coll = {'start_date': start_date, 'all_keys': all_keys}
if i == 0:
data_keys.append(this_coll)
else:
last_coll = data_keys[-1]
if this_coll['all_keys'] == last_coll['all_keys']:
continue
else:
data_keys.append(this_coll)
The correct output given here records each change of field name: Foo
, Bar
, Baz
as well as the change of case in field start_date
to Start_Date
:
[{'start_date': '2022-01-01T00:00:00Z',
'all_keys': ['_id', 'Location', 'start_date', 'Foo']},
{'start_date': '2022-02-01T00:00:00Z',
'all_keys': ['_id', 'Location', 'start_date', 'Bar']},
{'start_date': '2022-02-02T00:00:00Z',
'all_keys': ['_id', 'Location', 'Start_Date', 'Bar']},
{'start_date': '2022-03-01T00:00:00Z',
'all_keys': ['_id', 'Location', 'Start_Date', 'Baz']}]
Is there a general algorithm which covers this pattern comparing current to previous item in a stack?
I need to generalize this algorithm and find a solution to do exactly the same thing with MongoDB documents in a collection. In order for me to discover if Mongo has an Aggregation Pipeline Operator which I could use, I must first understand if this basic algorithm has other common forms so I know what to look for.
Or someone who knows MongoDB aggregation pipelines really well could suggest operators which would produce the desired result?
Advertisement
Answer
EDIT: If you want to use a query for this, one option is something like:
- The
$objectToArray
allow to format the keys as values, and the$ifNull
allows to check several options ofstart_date
. - The
$unwind
allows us to sort the keys. - The
$group
allow us to undo the$unwind
, but now with sorted keys $reduce
to create a string from all keys, so we’ll have something to compare.- group again, but now with our string, so we’ll only have documents for changes.
db.collection.aggregate([
{
$project: {
data: {$objectToArray: "$$ROOT"},
start_date: {$ifNull: ["$start_date", "$Start_Date"]}
}
},
{$unwind: "$data"},
{$project: {start_date: 1, key: "$data.k", _id: 0}},
{$sort: {start_date: 1, key: 1}},
{$group: {_id: "$start_date", all_keys: {$push: "$key"}}},
{
$project: {
all_keys: 1,
all_keys_string: {
$reduce: {
input: "$all_keys",
initialValue: "",
in: {$concat: ["$$value", "$$this"]}
}
}
}
},
{
$group: {
_id: "$all_keys_string",
all_keys: {$first: "$all_keys"},
start_date: {$first: "$_id"}
}
},
{$unset: "_id"}
])