Note : almost duplicate of Numpy vectorization: Find intersection between list and list of lists
Differences :
- I am focused on efficiently when the lists are large
- I’m searching for the largest intersections.
x = [500 numbers between 1 and N] y = [[1, 2, 3], [4, 5, 6, 7], [8, 9], [10, 11, 12], etc. up to N]
Here are some assumptions:
y
is a list of ~500,000 sublist of ~500 elements- each sublist in
y
is a range, soy
is characterized by the last elements of each sublists. In the example : 3, 7, 9, 12 … x
is not sortedy
contains once and only once each numbers between 1 and ~500000*500y
is sorted in the sense that, as in the example, the sub-lists are sorted and the first element of one sublist is the next of the last element of the previous list.y
is known long before even compile-time
My purpose is to know, among the sublists of y
, which have at least 10 intersections with x
.
I can obviously make a loop :
def find_best(x, y): result = [] for index, sublist in enumerate(y): intersection = set(x).intersection(set(sublist)) if len(intersection) > 2: # in real live: > 10 result.append(index) return(result) x = [1, 2, 3, 4, 5, 6] y = [[1, 2, 3], [4], [5, 6], [7], [8, 9, 10, 11]] res = find_best(x, y) print(res) # [0, 2]
Here the result is [0,2]
because the first and third sublist of y
have 2 elements in intersection with x
.
An other method should to parse only once y
and count the intesections :
def find_intersec2(x, y): n_sublists = len(y) res = {num: 0 for num in range(0, n_sublists + 1)} for list_no, sublist in enumerate(y): for num in sublist: if num in x: x.remove(num) res[list_no] += 1 return [n for n in range(n_sublists + 1) if res[n] >= 2]
This second method uses more the hypothesis.
Questions :
- what optimizations are possibles ?
- Is there a completely different approach ? Indexing, kdtree ? In my use case, the large list
y
is known days before the actual run. So i’m not afraid to buildind an index or whatever fromy
. The small listx
is only known at runtime.
Advertisement
Answer
Since y
contains disjoint ranges and the union of them is also a range, a very fast solution is to first perform a binary search on y
and then count the resulting indices and only return the ones that appear at least 10 times. The complexity of this algorithm is O(Nx log Ny)
with Nx
and Ny
the number of items in respectively x
and y
. This algorithm is nearly optimal (since x
needs to be read entirely).
Actual implementation
First of all, you need to transform your current y
to a Numpy array containing the beginning value of all ranges (in an increasing order) with N
as the last value (assuming N
is excluded for the ranges of y
, or N+1
otherwise). This part can be assumed as free since y
can be computed at compile time in your case. Here is an example:
import numpy as np y = np.array([1, 4, 8, 10, 13, ..., N])
Then, you need to perform the binary search and check that the values fits in the range of y:
indices = np.searchsorted(y, x, 'right') # The `0 < indices < len(y)` check should not be needed regarding the input. # If so, you can use only `indices -= 1`. indices = indices[(0 < indices) & (indices < len(y))] - 1
Then you need to count the indices and filter the ones with at least :
uniqueIndices, counts = np.unique(indices, return_counts=True) result = uniqueIndices[counts >= 10]
Here is an example based on your:
x = np.array([1, 2, 3, 4, 5, 6]) # [[1, 2, 3], [4], [5, 6], [7], [8, 9, 10, 11]] y = np.array([1, 4, 5, 7, 8, 12]) # Actual simplified version of the above algorithm indices = np.searchsorted(y, x, 'right') - 1 uniqueIndices, counts = np.unique(indices, return_counts=True) result = uniqueIndices[counts >= 2] # [0, 2] print(result.tolist())
It runs in less than 0.1 ms on my machine on a random input based on your input constraints.