Suppose I have a list that goes like : ”’ [1,2,3,4,9,10,11,20] ”’ I need the result to be like : ”’ [[4,9],[11,20]] ”’ I have defined a function that goes like this :
def get_range(lst): i=0 seqrange=[] for new in lst: a=[] start=new end=new if i==0: i=1 old=new else: if new - old >1: a.append(old) a.append(new) old=new if len(a): seqrange.append(a) return seqrange
Is there any other easier and efficient way to do it? I need to do this in the range of millions.
Advertisement
Answer
You can use numpy arrays and the diff
function that comes along with them. Numpy is so much more efficient than looping when you have millions of rows.
Slight aside: Why are numpy arrays so fast? Because they are arrays of data instead of arrays of pointers to data (which is what Python lists are), because they offload a whole bunch of computations to a backend written in C, and because they leverage the SIMD paradigm to run a Single Instruction on Multiple Data simultaneously.
Now back to the problem at hand:
The diff
function gives us the difference between consecutive elements of the array. Pretty convenient, given that we need to find where this difference is greater than a known threshold
!
import numpy as np threshold = 1 arr = np.array([1,2,3,4,9,10,11,20]) deltas = np.diff(arr) # There's a gap wherever the delta is greater than our threshold gaps = deltas > threshold gap_indices = np.argwhere(gaps) gap_starts = arr[gap_indices] gap_ends = arr[gap_indices + 1] # Finally, stack the two arrays horizontally all_gaps = np.hstack((gap_starts, gap_ends)) print(all_gaps) # Output: # [[ 4 9] # [11 20]]
You can access all_gaps
like a 2D matrix: all_gaps[0, 1]
would give you 9
, for example. If you really need the answer as a list-of-lists, simply convert it like so:
all_gaps_list = all_gaps.tolist() print(all_gaps_list) # Output: [[4, 9], [11, 20]]
Comparing the runtime of the iterative method from @happydave’s answer with the numpy method:
import random import timeit import numpy def gaps1(arr, threshold): deltas = np.diff(arr) gaps = deltas > threshold gap_indices = np.argwhere(gaps) gap_starts = arr[gap_indices] gap_ends = arr[gap_indices + 1] all_gaps = np.hstack((gap_starts, gap_ends)) return all_gaps def gaps2(lst, thr): seqrange = [] for i in range(len(lst)-1): if lst[i+1] - lst[i] > thr: seqrange.append([lst[i], lst[i+1]]) return seqrange test_list = [i for i in range(100000)] for i in range(100): test_list.remove(random.randint(0, len(test_list) - 1)) test_arr = np.array(test_list) # Make sure both give the same answer: assert np.all(gaps1(test_arr, 1) == gaps2(test_list, 1)) t1 = timeit.timeit('gaps1(test_arr, 1)', setup='from __main__ import gaps1, test_arr', number=100) t2 = timeit.timeit('gaps2(test_list, 1)', setup='from __main__ import gaps2, test_list', number=100) print(f"t1 = {t1}s; t2 = {t2}s; Numpy gives ~{t2 // t1}x speedup")
On my laptop, this gives:
t1 = 0.020834800001466647s; t2 = 1.2446780000027502s; Numpy gives ~59.0x speedup
My word that’s fast!