Skip to content
Advertisement

Iterables / generators with specified length

Iterable objects are those that implement __iter__ function, which returns an iterator object, i.e. and object providing the functions __iter__ and __next__ and behaving correctly. Usually the size of the iterable object is not known beforehand, and iterable object is not expected to know how long the iteration will last; however, there are some cases in which knowing the length of the iterable is valuable, for example, when creating an array. list(x for x in range(1000000)), for example, creates an initial array of small size, copies it after it is full, and repeats for many times as explained here. Of course, it is not that important in this example, but it explains the point.

Is there a protocol in use for those iterable objects who know their length beforehand? That is, is there a protocol extending Sized and Iterable but not Collection or Reversible? It seems like there is no such protocol in language features, is there such a protocol for well-known third-party libraries? How this discussion relates to generators?

Advertisement

Answer

It sounds like you’re asking about something like __length_hint__. Excerpts from PEP 424 – A method for exposing a length hint:

CPython currently defines a __length_hint__ method on several types, such as various iterators. This method is then used by various other functions (such as list) to presize lists based on the estimate returned by __length_hint__. Types which are not sized, and thus should not define __len__, can then define __length_hint__, to allow estimating or computing a size (such as many iterators).

Being able to pre-allocate lists based on the expected size, as estimated by __length_hint__, can be a significant optimization. CPython has been observed to run some code faster than PyPy, purely because of this optimization being present.

For example, range iterators support this (Try it online!):

it = iter(range(1000))
print(it.__length_hint__())     # prints 1000
next(it)
print(it.__length_hint__())     # prints 999

And list iterators even take list length changes into account (Try it online!):

a = [None] * 10
it = iter(a)
print(it.__length_hint__())     # prints 10
next(it)
print(it.__length_hint__())     # prints 9
a.pop()
print(it.__length_hint__())     # prints 8
a.append(None)
print(it.__length_hint__())     # prints 9

Generator iterators don’t support it, but you can support it in other iterators you write. Here’s a demo iterator that…

  • Produces 10,000 elements.
  • Hints at having 5,000 elements.
  • After every 1,000 elements it shows the memory size of the list being built.
import gc

beacon = object()

class MyIterator:
    def __init__(self):
        self.n = 10_000
    def __iter__(self):
        return self
    def __length_hint__(self):
        print('__length_hint__ called')
        return 5_000
    def __next__(self):
        if self.n == 0:
            raise StopIteration
        self.n -= 1
        if self.n % 1_000 == 0:
            for obj in gc.get_objects():
                if isinstance(obj, list) and obj and obj[0] is beacon:
                    print(obj.__sizeof__())
        return beacon

list(MyIterator())

Output (Try it online!):

__length_hint__ called
45088
45088
45088
45088
45088
50776
57168
64360
72456
81560

We see that list asks for a length hint and from the start pre-allocates enough memory for 5,000 references of 8 bytes each, plus 12.5% overallocation. After the first 5,000 elements, it doesn’t ask for length hints anymore, and keeps increasing its size bit by bit.

If my __length_hint__ instead accurately returns 10,000, then list instead pre-allocates 90088 bytes and that remains until the end.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement