Skip to content
Advertisement

Convert a normal python code to an MPI code

I have this code that I would like to edit and run it as an MPI code. The array in the code mass_array1 is a multi-dimensional array with total ‘iterations’ i*j around 80 million. I mean if I flatten the array into 1 dimensional array, there are 80 million elements.

The code takes almost 2 days to run which is quite annoying as it is only small part of the whole project. Since I can log into a cluster and run the code through 20 or so processors (or even more), can someone help me edit this code to an MPI code?

Even writing the MPI code in C language works.

JavaScript

Current C program using MPI on that cluster:

JavaScript

Submitting a job

After this, there is a job.sh file that looks something like this:

JavaScript

Mbhfit6

This is how I have define Mbhfit6 in my code:

JavaScript

mass_array1

Here, I have uploaded one of the files (in zip format) that contains the data for mass_array1. https://drive.google.com/file/d/1C-G28OSND7jxqkFZQS3dlW6_40yBN6Fy/view?usp=sharing

You need to unzip the file into a folder and then use the code below to import it in Python

This is my code to import the file: (its only 3 MB)

JavaScript

Advertisement

Answer

I don’t view this as a big enough set of data to require mpi provided you take an efficient approach to processing the data.

As I mentioned in the comments, I find the best approach to processing large amounts of numerical data is first to use numpy vectorization, then try using numba jit compiling, then use multi-core processing as a last resort. In general that’s following the order of easiest to hardest, and will also get you the most speed for the least work. In your case I think vectorization is truly the way to go, and while I was at it, I did some re-organization which isn’t really necessary, but helped me to keep track of the data.

JavaScript

I started off reading in the data in a slightly different way because I found it more natural to create a dictionary of arrays so the order they were created wouldn’t matter. The index of the file (number at the end of the file name) is used as the key of the dictionary, so it is easy to convert it back to a list if you want with something like: mass_array = list(mass_array[i] for i in range(1000))

Then looking at the rest of your code, all the numpy functions you used are able to process an entire array of data at a time much faster than one at a time using your inner loop (j), so I simply removed the inner loop, and re-wrote the body to use vectorization:

JavaScript

again if you want to convert the bhs dictionary back to a list like you previously had, it’s quite simple: bhs = list(bhs[i] for i in range(1000))

With these changes (and a relatively powerful PC) the code executed on the data files you provided in under half a second. with just over 700,000 values in the example dataset, if we extrapolate out to 80 million, that should be on the order of a minute or two.

P.S. if you find yourself using exec a lot with generated strings of code, you’ll almost always find there’s a better way to accomplish the same thing usually with just a slightly different data structure.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement