I have a big list, which may carry thousands to millions of entries. I set a window of finite size to slide over the list. I need to count the matched elements in the windows and repeat the procedure by sliding the window 1 position forward at a time. Here is a simple example of a list
L = [1 2 1 3 4 5 1 2 1 2 2 2 3 ]
Assuming the window is of the length of 3, it will capture
- [1 2 1] which contains one pair of matching elements (1 & 1)
- move the windows forward by 1 position => [2 1 3], no matching elements
- move the windows forward by 1 position => [1 3 4], no matching elements
- move the windows forward by 1 position => [3 4 5], no matching elements
- move the windows forward by 1 position => [4 5 1], no matching elements
- move the windows forward by 1 position => [5 1 2], no matching elements
- move the windows forward by 1 position => [1 2 1], 1 matching elements (1 & 1)
- move the windows forward by 1 position => [2 1 2], 1 matching elements (2 & 2)
- move the windows forward by 1 position => [1 2 2], 1 matching elements (2 & 2)
- move the windows forward by 1 position => [2 2 2], 3 matching elements ([2 2 -], [2 – 2], [- 2 2])
- move the windows forward by 1 position => [2 2 3], 1 matching elements (2 & 2)
So total 1 + 1 + 1 + 1 + 3 + 1 = 8 matching pairs. I found the idea to use itertools to find the combination of all elements in a window and develop a code to find all the matching pairs
import itertools L = [1,2,1,3,4,5,1,2,1,2,2,2,3] winlen = 3 totalMatch = 0 for n in range(len(L)-winlen+1): window = [L[n+i] for i in range(winlen)] A = list(itertools.combinations(window, 2)) match = [a==b for a, b in A] totalMatch += sum(match)
it works for a short list but for the list and the windows getting large, this code is so slow. I have been working with C++ for years and decide to switch to python, I will appreciate it if there is any hint to improve the efficiency of the code.
Advertisement
Answer
Keep track of the data in your window more efficiently? This is O(|L|) instead of your O(|L|*winlen^2). It keeps the window’s element counts in ctr
and the window’s matches in match
. For example, when a new value enters the window and there are already two instances of that value in the window, you get two new matches. Similarly for a value falling out of the window, it takes as many matches with it as there are other instances of it in the window.
from collections import Counter L = [1,2,1,3,4,5,1,2,1,2,2,2,3] winlen = 3 totalMatch = match = 0 ctr = Counter() for i, x in enumerate(L): # Remove old element falling out of window if i >= winlen: ctr[L[i-winlen]] -= 1 match -= ctr[L[i-winlen]] # Add new element to window match += ctr[x] ctr[x] += 1 # Update the total (for complete windows) if i >= winlen - 1: totalMatch += match print(totalMatch)
Benchmark results with L
and winlen
multiplied by 20:
38.75 ms original 0.18 ms Manuel 38.73 ms original 0.19 ms Manuel 38.87 ms original 0.18 ms Manuel
Benchmark code (also includes testing code for all lists of numbers 1 to 3 of lengths 0 to 9):
from timeit import repeat import itertools from itertools import product from collections import Counter def original(L, winlen): totalMatch = 0 for n in range(len(L)-winlen+1): window = [L[n+i] for i in range(winlen)] A = list(itertools.combinations(window, 2)) match = [a==b for a, b in A] totalMatch += sum(match) return totalMatch def Manuel(L, winlen): totalMatch = match = 0 ctr = Counter() for i, x in enumerate(L): if i >= winlen: ctr[L[i-winlen]] -= 1 match -= ctr[L[i-winlen]] match += ctr[x] ctr[x] += 1 if i >= winlen - 1: totalMatch += match return totalMatch def test(): for n in range(10): print('testing', n) for T in product([1, 2, 3], repeat=n): L = list(T) winlen = 3 expect = original(L, winlen) result = Manuel(L, winlen) assert result == expect, (L, expect, result) def bench(): L = [1,2,1,3,4,5,1,2,1,2,2,2,3] * 20 winlen = 3 * 20 for _ in range(3): for func in original, Manuel: t = min(repeat(lambda: func(L, winlen), number=1)) print('%6.2f ms ' % (t * 1e3), func.__name__) print() test() bench()