Skip to content
Advertisement

Any easy way to transform a missing number sequence to its range?

Suppose I have a list that goes like : ”’ [1,2,3,4,9,10,11,20] ”’ I need the result to be like : ”’ [[4,9],[11,20]] ”’ I have defined a function that goes like this :

def get_range(lst):
i=0
seqrange=[]
for new in lst:
    a=[]
    start=new
    end=new
    if i==0:
        i=1
        old=new
    else:
        if new - old >1:
            a.append(old)
            a.append(new)
    old=new
    if len(a):
        seqrange.append(a)
return seqrange

Is there any other easier and efficient way to do it? I need to do this in the range of millions.

Advertisement

Answer

You can use numpy arrays and the diff function that comes along with them. Numpy is so much more efficient than looping when you have millions of rows.


Slight aside: Why are numpy arrays so fast? Because they are arrays of data instead of arrays of pointers to data (which is what Python lists are), because they offload a whole bunch of computations to a backend written in C, and because they leverage the SIMD paradigm to run a Single Instruction on Multiple Data simultaneously.


Now back to the problem at hand:

The diff function gives us the difference between consecutive elements of the array. Pretty convenient, given that we need to find where this difference is greater than a known threshold!

import numpy as np

threshold = 1
arr = np.array([1,2,3,4,9,10,11,20])

deltas = np.diff(arr)
# There's a gap wherever the delta is greater than our threshold
gaps = deltas > threshold 
gap_indices = np.argwhere(gaps)

gap_starts = arr[gap_indices]
gap_ends = arr[gap_indices + 1] 

# Finally, stack the two arrays horizontally
all_gaps = np.hstack((gap_starts, gap_ends))
print(all_gaps)
# Output: 
# [[ 4  9]
#  [11 20]]

You can access all_gaps like a 2D matrix: all_gaps[0, 1] would give you 9, for example. If you really need the answer as a list-of-lists, simply convert it like so:

all_gaps_list = all_gaps.tolist()
print(all_gaps_list)
# Output: [[4, 9], [11, 20]]

Comparing the runtime of the iterative method from @happydave’s answer with the numpy method:

import random
import timeit

import numpy

def gaps1(arr, threshold):
    deltas = np.diff(arr)
    gaps = deltas > threshold 
    gap_indices = np.argwhere(gaps)
    gap_starts = arr[gap_indices]
    gap_ends = arr[gap_indices + 1] 
    all_gaps = np.hstack((gap_starts, gap_ends))
    return all_gaps

def gaps2(lst, thr):
    seqrange = []
    for i in range(len(lst)-1):
      if lst[i+1] - lst[i] > thr:
        seqrange.append([lst[i], lst[i+1]])
    return seqrange

test_list = [i for i in range(100000)]
for i in range(100):
    test_list.remove(random.randint(0, len(test_list) - 1))

test_arr = np.array(test_list)

# Make sure both give the same answer:
assert np.all(gaps1(test_arr, 1) == gaps2(test_list, 1))

t1 = timeit.timeit('gaps1(test_arr, 1)', setup='from __main__ import gaps1, test_arr', number=100)
t2 = timeit.timeit('gaps2(test_list, 1)', setup='from __main__ import gaps2, test_list', number=100)

print(f"t1 = {t1}s; t2 = {t2}s; Numpy gives ~{t2 // t1}x speedup")

On my laptop, this gives:

t1 = 0.020834800001466647s; t2 = 1.2446780000027502s; Numpy gives ~59.0x speedup

My word that’s fast!

Advertisement