Any easy way to transform a missing number sequence to its range?

Question

Suppose I have a list that goes like : &#8221;&#8217; [1,2,3,4,9,10,11,20] &#8221;&#8217; I need the result to be like : &#8221;&#8217; [[4,9],[11,20]] &#8221;&#8217; I have defined a function that goes like this : Is there any other easier and efficient way to do it? I need to do this in the range of million…

Accepted Answer

You can use numpy arrays and the diff function that comes along with them. Numpy is so much more efficient than looping when you have millions of rows.Slight aside:Why are numpy arrays so fast? Because they are arrays of data instead of arrays of pointers to data (which is what Python lists are), because they offload a whole bunch of computations to a backend written in C, and because they leverage the SIMD paradigm to run a Single Instruction on Multiple Data simultaneously.Now back to the problem at hand:The diff function gives us the difference between consecutive elements of the array. Pretty convenient, given that we need to find where this difference is greater than a known threshold!import numpy as npthreshold = 1arr = np.array([1,2,3,4,9,10,11,20])deltas = np.diff(arr)# There's a gap wherever the delta is greater than our thresholdgaps = deltas > threshold gap_indices = np.argwhere(gaps)gap_starts = arr[gap_indices]gap_ends = arr[gap_indices + 1] # Finally, stack the two arrays horizontallyall_gaps = np.hstack((gap_starts, gap_ends))print(all_gaps)# Output: # [[ 4  9]#  [11 20]]You can access all_gaps like a 2D matrix: all_gaps[0, 1] would give you 9, for example. If you really need the answer as a list-of-lists, simply convert it like so:all_gaps_list = all_gaps.tolist()print(all_gaps_list)# Output: [[4, 9], [11, 20]]Comparing the runtime of the iterative method from @happydave&#8217;s answer with the numpy method:import randomimport timeitimport numpydef gaps1(arr, threshold):    deltas = np.diff(arr)    gaps = deltas > threshold     gap_indices = np.argwhere(gaps)    gap_starts = arr[gap_indices]    gap_ends = arr[gap_indices + 1]     all_gaps = np.hstack((gap_starts, gap_ends))    return all_gapsdef gaps2(lst, thr):    seqrange = []    for i in range(len(lst)-1):      if lst[i+1] - lst[i] > thr:        seqrange.append([lst[i], lst[i+1]])    return seqrangetest_list = [i for i in range(100000)]for i in range(100):    test_list.remove(random.randint(0, len(test_list) - 1))test_arr = np.array(test_list)# Make sure both give the same answer:assert np.all(gaps1(test_arr, 1) == gaps2(test_list, 1))t1 = timeit.timeit('gaps1(test_arr, 1)', setup='from __main__ import gaps1, test_arr', number=100)t2 = timeit.timeit('gaps2(test_list, 1)', setup='from __main__ import gaps2, test_list', number=100)print(f"t1 = {t1}s; t2 = {t2}s; Numpy gives ~{t2 // t1}x speedup")On my laptop, this gives:t1 = 0.020834800001466647s; t2 = 1.2446780000027502s; Numpy gives ~59.0x speedupMy word that&#8217;s fast!

Advertisement

Answer