Length of the intersections between a list an list of list

Question

Note : almost duplicate of Numpy vectorization: Find intersection between list and list of lists Differences : I am focused on efficiently when the lists are large I&#8217;m searching for the largest intersections. Here are some assumptions: y is a list of ~500,000 sublist of ~500 elements each sublist in y i…

Accepted Answer

Since y contains disjoint ranges and the union of them is also a range, a very fast solution is to first perform a binary search on y and then count the resulting indices and only return the ones that appear at least 10 times. The complexity of this algorithm is O(Nx log Ny) with Nx and Ny the number of items in respectively x and y. This algorithm is nearly optimal (since x needs to be read entirely).Actual implementationFirst of all, you need to transform your current y to a Numpy array containing the beginning value of all ranges (in an increasing order) with N as the last value (assuming N is excluded for the ranges of y, or N+1 otherwise). This part can be assumed as free since y can be computed at compile time in your case. Here is an example:import numpy as npy = np.array([1, 4, 8, 10, 13, ..., N])Then, you need to perform the binary search and check that the values fits in the range of y:indices = np.searchsorted(y, x, 'right')# The `0 < indices < len(y)` check should not be needed regarding the input.# If so, you can use only `indices -= 1`.indices = indices[(0 < indices) & (indices < len(y))] - 1Then you need to count the indices and filter the ones with at least :uniqueIndices, counts = np.unique(indices, return_counts=True)result = uniqueIndices[counts >= 10]Here is an example based on your:x = np.array([1, 2, 3, 4, 5, 6])# [[1, 2, 3], [4], [5, 6], [7], [8, 9, 10, 11]]y = np.array([1, 4, 5, 7, 8, 12])# Actual simplified version of the above algorithmindices = np.searchsorted(y, x, 'right') - 1uniqueIndices, counts = np.unique(indices, return_counts=True)result = uniqueIndices[counts >= 2]# [0, 2]print(result.tolist())It runs in less than 0.1 ms on my machine on a random input based on your input constraints.

Advertisement

Answer

Actual implementation