Problem
I have a list of objects. Each object has two attributes: “score” and “coordinates”. I need to find the largest N objects of the list based on the score
attribute. The main problem I’m having is sorting the objects using only the score
attribute. Sorting can be partial. I’m only interested in the N largest objects.
Current Solution
My current approach is not the most elegant nor the most efficient. The idea is to create a dictionary
of object indices
and their score
, then sort the list of scores and use the dictionary
to index the objects that yielded the largest scores.
These are the steps:
Create a list of
scores
. Each element of the list corresponds to an object. That is, the first entry is the score of the first object, the second entry is the score of the second object, and so on.Create a
dictionary
using the object’sscores
askey
and objectindex
asvalue
.Order the score list using a
heapq
to get theN
largest objects.Use the
dictionary
to get those objects that have the largestscores
.Create a new
list
with only theN
largest-score objects.
Code Snippets
This is my sorting function:
import random import heapq # Gets the N objects with the largest score: def getLargest(N, objects): # Set output objects: outobjects = objects # Get the total of objects in list: totalobjects = len(objects) # Check if the total number of objects is bigger than the N requested # largest objects: if totalobjects > N: # Get the "score" attributes from all the objects: objectScores = [o.score for o in objects] # Create a dictionary with the index of the objects and their score. # I'm using a dictionary to keep track of the largest scores and # the objects that produced them: objectIndices = range(totalobjects) objectDictionary = dict(zip(objectIndices, objectScores)) # Get the N largest objects based on score: largestObjects = heapq.nlargest(N, objectScores) print(largestObjects) # Prepare the output list of objects: outobjects = [None] * N # Look for those objects that produced the # largest score: for k in range(N): # Get current largest object: currentLargest = largestObjects[k] # Get its original position on the keypoint list: position = objectScores.index(currentLargest) # Index the corresponding keypoint and store it # in the output list: outobjects[k] = objects[position] # Done: return outobjects
This snippet generates 100
random objects used to test my method. The final loop prints the N = 3
randomly-generated objects with the largest score
:
# Create a list with random objects: totalObjects = 100 randomObjects = [] # Test object class: class Object(object): pass # Generate a list of random objects for i in range(totalObjects): # Instance of objects: tempObject = Object() # Set the object's random score random.seed() tempObject.score = random.random() # Set the object's random coordinates: tempObject.coordinates = (random.randint(0, 5), random.randint(0, 5)) # Store object into list: randomObjects.append(tempObject) # Get the 3 largest objects sorted by score: totalLargestObjects = 3 largestObjects = getLargest(totalLargestObjects, randomObjects) # Print the filtered objects: for i in range(len(largestObjects)): # Get the current object in the list: currentObject = largestObjects[i] # Get its score: currentScore = currentObject.score # Get its coordinates as a tuple (x,y) currentCoordinates = currentObject.coordinates # Print the info: print("object: " + str(i) + " score: " + str(currentScore) + " x: " + str( currentCoordinates[0]) + " y: " + str(currentCoordinates[1]))
My current method gets the job done, but there has to be a more pythonic (more vectorized) way of achieving the same thing. My background is mostly C++, I’m still learning Python. Any feedback is welcome.
Additional info
Initially, I was looking for something similar to C++’s std:: nth_element
. It appears this functionality is somewhat provided in Python by NumPy’s partition
. Unfortunately, while std::nth_element
supports a predicate for custom ordering, NumPy’s partition
does not. I ended up using a heapq
, which does the job nicely and sorts in the desired order, but I don’t know the optimal way to sort based on one attribute.
Advertisement
Answer
Tuple is what you needed. Instead of storing scores in a heap, stores a tuple of (score, object)
in the heap. It will try to compare by the score and return a list of tuples which you can use to retrieve the original objects. This will save you extra steps to retrieve objects by their scores:
heapq.nlargest(3, ((obj.score, obj) for obj in randomObjects)) # [(0.9996643881256989, <__main__.Object object at 0x155f730>), (0.9991398955041872, <__main__.Object object at 0x119e928>), (0.9858047551444177, <__main__.Object object at 0x15e38c0>)]
For a real world example: https://akuiper.com/console/g6YuNa_1WClp
Or as @shriakhilc commented, use key
parameter in heapq.nlargest
to specify you want to compare by score:
heapq.nlargest(3, randomObjects, lambda o: o.score)