Skip to content
Advertisement

Accumulate sliding windows relative to origin

I have an array A with the shape (3,3) which can be thought of as the sliding window view of an unkown array with the shape (5,). I want to compute the inverse of windowing the array with the shape (5,). The adjoint operation of this will be summation. What I mean is that I want to accumulate the values in each corresponding window with the related position in the array with the shape (5,). Ofcourse, my expected output of this inverse function and the input A are not related and are just ordinary arrays. I have two examples which I hope explains this better.

A = np.array([[0, 0, 1],
              [0, 0, 1],
              [0, 0, 1]], dtype=np.float32)

I expect this output:

np.array([0, 0, 1, 1, 1])

The other example:

A = np.array([[1, 2, 3],
              [2, 3, 4],
              [3, 4, 5]], dtype=np.float32)

I expect this output:

np.array([1, 2+2, 3+3+3, 4+4, 5]) = np.array([1, 4, 9, 8, 5])

The solution I have which is quite slow (result stored in out)

out = np.zeros(5, dtype=np.float32)
windows = np.lib.stride_tricks.as_strided(out, shape=(3,3), strides=(4,4))
for i in np.ndindex(windows.shape):
  windows[i] += A[i]

Writing to a strided view feels a bit hacky and I am sure there is a better solution.

Is there any way to write this in a vectorized manner, without the for-loop? (which also generalizes for multiple dimensions)

EDIT

In terms of generalizing for higher dimensions, I have cases where the windows are taken from an image (2d array), instead of a 1d array like the example above. For the 2d case, A can for example be windows of size 3. This means that from an image (output) with the shape (4,4), The windows A will have the shape (2,2,3,3).

A = np.array([[[[0, 0, 0],
                [0, 1, 0],
                [0, 0, 0]],

               [[0, 0, 0],
                [1, 0, 0],
                [0, 0, 0]]],


              [[[0, 1, 0],
                [0, 0, 0],
                [0, 0, 0]],

               [[1, 0, 0],
                [0, 0, 0],
                [0, 0, 0]]]], dtype=np.float32)

Using the solution given by Pablo, I get the following error

value array of shape (2,2,3,3)  could not be broadcast to indexing result of shape (2,2)

Using a slightly modified version of my stride solution:

def inverse_sliding_windows(A, window_sz, image_sz):
  out = np.zeros(image_sz, dtype=np.float32)
  windows = np.lib.stride_tricks.sliding_window_view(out, window_sz, writeable=True)
  for i in np.ndindex(windows.shape):
    windows[i] += A[i]

window_sz = (3,3)
image_sz = (4,4)
inverse_sliding_windows(A, window_sz, image_sz)

Output:

array([[0., 0., 0., 0.],
       [0., 4., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]], dtype=float32)

To clarify, the window size and output shape is known beforehand, see inverse_sliding_windows.

Advertisement

Answer

As I mentioned in the comment, a vectorized solution doesn’t always guarantee a better running time. If your matrix is large, you might prefer more efficient methods. And there are several reasons why matrix rotation is slow (though, intuitive), see comment.

Performance comparison:

Solution: Wall time: 61.6 ms
Rotation: Wall time: 3.32 s

Code (tested in jupyter notebook)

import numpy as np

def rotate45_and_sum(A):
    n = len(A) 
    x, y = np.meshgrid(np.arange(n), np.arange(n))  # at least doubled the running time
    xn, yn = x + y, n - x + y - 1   # generating xn and yn at least doubled the running time
    M = np.zeros((2*n -1, 2*n -1))  # at least slows down running time by a factor of 4
    M[xn,yn] = A[x,y] # very inefficient indexing strategy
    return M.sum(1)

def solution(A):
    n = A.shape[0]
    retval = np.zeros(2*n-1)
    for i in range(n):
        retval[i:(i+n)] += A[i, :]
    return retval

A = np.random.randn(10000, 10000)

%time solution(A)

%time rotate45_and_sum(A)

In multidimensional situation:

def solution(A):
    h,w,x,y = A.shape                # change here
    retval = np.zeros((2*x-w,2*y-h)) # change here
    indices = np.ndindex(w, h)       # change here
    for index in indices:
        slices = tuple()
        for i in range(len(index)):
            slices = slices + (slice(index[i], index[i]+x),) # I assume x = y = ..., you need to change here also if the assumption is not correct
        retval[slices] += A[index] # slices is roughly equal `i:(i+x), j:(j+y)` in your code
    return retval

Actually I don’t know the how the dimensions (or shapes) are calculated based on your description :(. But i think it could be generalized. The idea is to construct slices as you go. So you need to specify which dimensions correspond to h, w, which correspond to x, y. I think it’s not difficult to do that.

Reference: Numpy index array of unknown dimensions?


Regarding https://stackoverflow.com/a/67341994/14923227

def fast(A):
    n = A.shape[0]
    retval = np.zeros(2*n-1)
    for i in range(n):
        retval[i:(i+n)] += A[i, :]
    print(retval.sum())
    return retval

##########################
import threading

class sumThread(threading.Thread):
    def __init__(self, A, mat, threadID, ngroups, size):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.size = size
        self.ngroups = ngroups
        self.mat = mat
        self.A = A
    def run(self):
        begin = (self.size + self.ngroups) // self.ngroups * self.threadID
        end   = min(self.size, (self.size+self.ngroups)//self.ngroups*(self.threadID+1))
        for i in range(begin, end):
            self.mat[self.threadID, i:(i+self.size)] += self.A[i, :]

def faster(A):
    
    num_threads = max(1, A.shape[0] // 4000) 
    mat = np.zeros((num_threads, 2*A.shape[0]-1))
    threads = []
    for i in range(num_threads):
        t = sumThread(A, mat, i, num_threads, A.shape[0])
        t.start()
        threads.append(t)

    # Wait for all threads to complete
    for t in threads:
        t.join()
    return np.sum(mat, axis=0)
    

Performance for large array:

A = np.random.randn(20000,20000)
%timeit fast(A)   # 263 ms ± 5.21 ms per loop 
%timeit faster(A) # 155 ms ± 3.14 ms per loop

enter image description here

It’s trivial to parallelize the for loop in fast. But fast is actually the most cache efficient (even for GPU cache and memory banks) and thus the fastest way to compute it. Ideally, you can parallelize the code with CUDA/OpenCL since there are way more cores in a GPU. If you do it correctly, the running time will be reduced to log(original_fast_time) with base k, where k is the number of cores you have.

However, there are only a few computations in the function. So the transportation of data between memory and GRAM might dominate. (I didn’t test it)

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement