I am trying to improve the performance of some Python code. In that code, one column of a matrix (numpy-array) has to be changed temporarily.
The given code looks as follows:
def get_Ai_copy(A, b, i): Ai = A.copy() Ai[:,i] = b[:,0] return Ai
Now I thought it should be a big improvement to not create a copy of the entire matrix A (in the example used, the matrix is 500×500 with all entries strictly greater than 0), and instead just use np.column_stack()
to create a new temporary matrix out of the columns I need, looking like this:
def get_Ai(A, b, i): return np.column_stack([A[:,:i], b, A[:,i+1:]])
I would have expected this to give a big performance increase, but it turns out, it is actually slower than the given method:
I ran both ways 100 times and compared the average runtime:
number_tests = 100 copy_times = np.empty(number_tests) stacking_times = np.empty(number_tests) for j in range(number_tests): t0 = time.time() for i in range(500): Ai = get_Ai_copy(A, b, i) t1 = time.time() copy_times[j] = t1 - t0 # print(f'-- Run # {j}: CPU time for copying Ai = %g seconds'%(t1 - t0)) t0 = time.time() for i in range(500): Ai = get_Ai(A, b, i) t1 = time.time() stacking_times[j] = t1 - t0 # print(f'-- Run # {j}: CPU time for column stacking Ai = %g seconds'%(t1 - t0)) # print() print(f'Copying times average: {np.mean(copy_times)}') print(f'Stacking times average: {np.mean(stacking_times)}')
And the result was:
Copying times average: 0.19957998037338257
Stacking times average: 0.22774386405944824
I don’t understand why that is the case.
Is there some explanation to it that I’m not seeing? Is it maybe more Cache-friendly to copy the entire Array, than to take the 3 slices? If so, does anyone know, why?
Side-info:
I’m running this in a Jupyter notebook, on a Laptop with an Intel i7 10750H (12mb Cache) and 32GB Ram.
A
is always a nonsingular matrix, if that matters.
Advertisement
Answer
The copy
method of numpy arrays will trigger code that will just copy over all array data at maximum CPU speed, in native code – if it is 500x500x8 bytes per element, we are talking about ~2MB of data – that fits conforably even in the cache of your CPU.
And numpy only have to create metadata for a single Python object.
On the other hand, column_stack
runs some Python code (although not on fine grained objects, or it would be worse), and ends up copying the array nonetheless (it take your slices of the current array – an slice is not copied, but then calls np.concatenate internally, which triggers the copy). So you just add the overhead of copying the data in parts, and some juggling to create on the order of 10 Python-level array objects in the process (between slicing, concatenating, etc…) – that makes up for the 10% extra time you get.