Iterable objects are those that implement
__iter__ function, which returns an iterator object, i.e. and object providing the functions
__next__ and behaving correctly. Usually the size of the iterable object is not known beforehand, and iterable object is not expected to know how long the iteration will last; however, there are some cases in which knowing the length of the iterable is valuable, for example, when creating an array.
list(x for x in range(1000000)), for example, creates an initial array of small size, copies it after it is full, and repeats for many times as explained here. Of course, it is not that important in this example, but it explains the point.
Is there a protocol in use for those iterable objects who know their length beforehand? That is, is there a protocol extending Sized and Iterable but not Collection or Reversible? It seems like there is no such protocol in language features, is there such a protocol for well-known third-party libraries? How this discussion relates to generators?
It sounds like you’re asking about something like
__length_hint__. Excerpts from PEP 424 – A method for exposing a length hint:
CPython currently defines a
__length_hint__method on several types, such as various iterators. This method is then used by various other functions (such as
list) to presize lists based on the estimate returned by
__length_hint__. Types which are not sized, and thus should not define
__len__, can then define
__length_hint__, to allow estimating or computing a size (such as many iterators).
Being able to pre-allocate lists based on the expected size, as estimated by
__length_hint__, can be a significant optimization. CPython has been observed to run some code faster than PyPy, purely because of this optimization being present.
range iterators support this (Try it online!):
it = iter(range(1000)) print(it.__length_hint__()) # prints 1000 next(it) print(it.__length_hint__()) # prints 999
list iterators even take list length changes into account (Try it online!):
a = [None] * 10 it = iter(a) print(it.__length_hint__()) # prints 10 next(it) print(it.__length_hint__()) # prints 9 a.pop() print(it.__length_hint__()) # prints 8 a.append(None) print(it.__length_hint__()) # prints 9
Generator iterators don’t support it, but you can support it in other iterators you write. Here’s a demo iterator that…
- Produces 10,000 elements.
- Hints at having 5,000 elements.
- After every 1,000 elements it shows the memory size of the list being built.
import gc beacon = object() class MyIterator: def __init__(self): self.n = 10_000 def __iter__(self): return self def __length_hint__(self): print('__length_hint__ called') return 5_000 def __next__(self): if self.n == 0: raise StopIteration self.n -= 1 if self.n % 1_000 == 0: for obj in gc.get_objects(): if isinstance(obj, list) and obj and obj is beacon: print(obj.__sizeof__()) return beacon list(MyIterator())
Output (Try it online!):
__length_hint__ called 45088 45088 45088 45088 45088 50776 57168 64360 72456 81560
We see that
list asks for a length hint and from the start pre-allocates enough memory for 5,000 references of 8 bytes each, plus 12.5% overallocation. After the first 5,000 elements, it doesn’t ask for length hints anymore, and keeps increasing its size bit by bit.
__length_hint__ instead accurately returns 10,000, then
list instead pre-allocates
90088 bytes and that remains until the end.