Skip to content
Advertisement

Am I parallelizing this right?

I basically have to obtain a certain probability distribution by sampling from a sample of 5000 matrices. So I just need to count how many times element X occurs in the position (i,j) of these 5000 matrices. Then, I save these [values and counts] in a dictionary.

That said, I thought it could be a good idea to parallelize my code, as a serial code would run incredibly slow. The code is the following:

JavaScript

Since it is my first time parallelizing a function, I would like some feedback. Also, as this is still slow due to the fact that it has to load a 1000×306 matrix, any advice on how to improve it would be very welcome.

Advertisement

Answer

Based on this description:

how many times element X occurs in the position (i,j) of these 5000 matrices

I would re-structure your code to return a 306×306 array of dictionaries which have keys for each value occurring in that position, and values for how many times that value occurs. You can then generate the data for a subset of the files in parallel, and then merge the data at the end. You should adjust the chunksize to load many file at once (as much as you have ram for) to reduce the number of times you have to loop manually over the array indices. Re-ordering the data into “Fortran” order should make array access more efficient (calls to np.unique will be faster) when accessing arr[:,i,j].

JavaScript
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement