Accumulate sliding windows relative to origin

Question

I have an array A with the shape (3,3) which can be thought of as the sliding window view of an unkown array with the shape (5,). I want to compute the inverse of windowing the array with the shape (5,). The adjoint operation of this will be summation. What I mean is that I want to accumulate the values

Accepted Answer

As I mentioned in the comment, a vectorized solution doesn&#8217;t always guarantee a better running time. If your matrix is large, you might prefer more efficient methods. And there are several reasons why matrix rotation is slow (though, intuitive), see comment.Performance comparison:Solution: Wall time: 61.6 msRotation: Wall time: 3.32 sCode (tested in jupyter notebook)import numpy as npdef rotate45_and_sum(A):    n = len(A)     x, y = np.meshgrid(np.arange(n), np.arange(n))  # at least doubled the running time    xn, yn = x + y, n - x + y - 1   # generating xn and yn at least doubled the running time    M = np.zeros((2*n -1, 2*n -1))  # at least slows down running time by a factor of 4    M[xn,yn] = A[x,y] # very inefficient indexing strategy    return M.sum(1)def solution(A):    n = A.shape[0]    retval = np.zeros(2*n-1)    for i in range(n):        retval[i:(i+n)] += A[i, :]    return retvalA = np.random.randn(10000, 10000)%time solution(A)%time rotate45_and_sum(A)In multidimensional situation:def solution(A):    h,w,x,y = A.shape                # change here    retval = np.zeros((2*x-w,2*y-h)) # change here    indices = np.ndindex(w, h)       # change here    for index in indices:        slices = tuple()        for i in range(len(index)):            slices = slices + (slice(index[i], index[i]+x),) # I assume x = y = ..., you need to change here also if the assumption is not correct        retval[slices] += A[index] # slices is roughly equal `i:(i+x), j:(j+y)` in your code    return retvalActually I don&#8217;t know the how the dimensions (or shapes) are calculated based on your description :(. But i think it could be generalized. The idea is to construct slices as you go. So you need to specify which dimensions correspond to h, w, which correspond to x, y. I think it&#8217;s not difficult to do that.Reference: Numpy index array of unknown dimensions?Regarding https://stackoverflow.com/a/67341994/14923227def fast(A):    n = A.shape[0]    retval = np.zeros(2*n-1)    for i in range(n):        retval[i:(i+n)] += A[i, :]    print(retval.sum())    return retval##########################import threadingclass sumThread(threading.Thread):    def __init__(self, A, mat, threadID, ngroups, size):        threading.Thread.__init__(self)        self.threadID = threadID        self.size = size        self.ngroups = ngroups        self.mat = mat        self.A = A    def run(self):        begin = (self.size + self.ngroups) // self.ngroups * self.threadID        end   = min(self.size, (self.size+self.ngroups)//self.ngroups*(self.threadID+1))        for i in range(begin, end):            self.mat[self.threadID, i:(i+self.size)] += self.A[i, :]def faster(A):        num_threads = max(1, A.shape[0] // 4000)     mat = np.zeros((num_threads, 2*A.shape[0]-1))    threads = []    for i in range(num_threads):        t = sumThread(A, mat, i, num_threads, A.shape[0])        t.start()        threads.append(t)    # Wait for all threads to complete    for t in threads:        t.join()    return np.sum(mat, axis=0)    Performance for large array:A = np.random.randn(20000,20000)%timeit fast(A)   # 263 ms ± 5.21 ms per loop %timeit faster(A) # 155 ms ± 3.14 ms per loopIt&#8217;s trivial to parallelize the for loop in fast. But fast is actually the most cache efficient (even for GPU cache and memory banks) and thus the fastest way to compute it. Ideally, you can parallelize the code with CUDA/OpenCL since there are way more cores in a GPU. If you do it correctly, the running time will be reduced to log(original_fast_time) with base k, where k is the number of cores you have.However, there are only a few computations in the function. So the transportation of data between memory and GRAM might dominate. (I didn&#8217;t test it)

Advertisement

Answer