Assuming a sorted list of integers as below:
data = [1] * 3 + [4] * 5 + [5] * 2 + [9] * 3 # [1, 1, 1, 4, 4, 4, 4, 4, 5, 5, 9, 9, 9]
I want to find the indices where the values changes, i.e.
[3, 8, 10, 13]
One approach is to use itertools.groupby
:
cursor = 0 result = [] for key, group in groupby(data): cursor += sum(1 for _ in group) result.append(cursor) print(result)
Output
[3, 8, 10, 13]
This approach is O(n). Another possible approach is to use bisect.bisect_left
:
cursor = 0 result = [] while cursor < len(data): cursor = bisect_left(data, data[cursor] + 1, cursor, len(data)) result.append(cursor) print(result)
Output
[3, 8, 10, 13]
This approach is O(k*log n) where k is the number of distinct elements. A variant of this approach is to use an exponential search.
Is there any faster or more performant way of doing this?
Advertisement
Answer
When it comes to asymptotic complexity I think you can improve slightly on the binary search on average when you apply a more evenly spread divide-and-conquer approach: try to first pinpoint the value-change that occurs closer to the middle of the input list, thereby splitting the range in approximately two halves, which would reduce the next binary search operation path by about one.
Yet, because this is Python, the gain might not be noticeable, because of the Python-code overhead (like for yield
, yield from
, the recursion, …). It might even perform worse for the list sizes you work with:
from bisect import bisect_left def locate(data, start, end): if start >= end or data[start] == data[end - 1]: return mid = (start + end) // 2 val = data[mid] if val == data[start]: start = mid val += 1 i = bisect_left(data, val, start + 1, end) yield from locate(data, start, i) yield i yield from locate(data, i, end) data = [1] * 3 + [4] * 5 + [5] * 2 + [9] * 3 print(*locate(data, 0, len(data))) # 3 8 10
Note that this only outputs valid indices, so 13 is not included for this example input.