pymongo: remove duplicates (map reduce?)

Question

I do have a Database with several collections (overall ~15mil documents) and documents look like this (simplified): They all have an unique _id field as well, but I want to delete duplicates accodring to another field (the external ID field). First, I tried a very manual approach with lists and deleting after…

Accepted Answer

An alternative approach is to use the aggregation framework which has better performance than map-reduce. Consider the following aggregation pipeline which as the first stage of the aggregation pipeline, the $group operator groups documents by the ID field and stores in the unique_ids field each _id value of the grouped records using the $addToSet operator. The $sum accumulator operator adds up the values of the fields passed to it, in this case the constant 1 &#8211; thereby counting the number of grouped records into the count field. The other pipeline step $match filters documents with a count of at least 2, i.e. duplicates. Once you get the result from the aggregation, you iterate the cursor to remove the first _id in the unique_ids field, then push the rest into an array that will be used later to remove the duplicates (minus one entry):cursor = db.coll.aggregate(    [        {"$group": {"_id": "$ID", "unique_ids": {"$addToSet": "$_id"}, "count": {"$sum": 1}}},        {"$match": {"count": { "$gte": 2 }}}    ])response = []for doc in cursor:    del doc["unique_ids"][0]    for id in doc["unique_ids"]:        response.append(id)coll.remove({"_id": {"$in": response}})

Advertisement

Answer