Convert a normal python code to an MPI code

Question

I have this code that I would like to edit and run it as an MPI code. The array in the code mass_array1 is a multi-dimensional array with total &#8216;iterations&#8217; i*j around 80 million. I mean if I flatten the array into 1 dimensional array, there are 80 million elements. The code takes almost 2 days to…

Accepted Answer

I don&#8217;t view this as a big enough set of data to require mpi provided you take an efficient approach to processing the data.As I mentioned in the comments, I find the best approach to processing large amounts of numerical data is first to use numpy vectorization, then try using numba jit compiling, then use multi-core processing as a last resort. In general that&#8217;s following the order of easiest to hardest, and will also get you the most speed for the least work. In your case I think vectorization is truly the way to go, and while I was at it, I did some re-organization which isn&#8217;t really necessary, but helped me to keep track of the data.import numpy as npfrom pathlib import Pathimport redirlist=[r"C:UsersaaronDownloadsbh2e8"]dirlist = [Path(d) for d in dirlist] #convert directory paths to pathlib.Path objects for ease of file system manipulationinitial_mass = {} #use a dictionary so we don't have to preallocate indicesmass_array = {} #use a dictionary so we don't have to preallocate indicesfor dir_path in dirlist:    for child in dir_path.iterdir():        m = re.match(".*?test(?P<index>d+).dat$", str(child))        if m: #if we match the end of the child path as a testxxx.dat file (not another directory or some other file type)            file_index = int(m["index"])            with child.open() as f:                arr = [float(line) for line in f if line.strip()] #1d array of float numbers skipping any empty lines            initial_mass[file_index] = arr[0]            mass_array[file_index] = np.array(arr[1:])I started off reading in the data in a slightly different way because I found it more natural to create a dictionary of arrays so the order they were created wouldn&#8217;t matter. The index of the file (number at the end of the file name) is used as the key of the dictionary, so it is easy to convert it back to a list if you want with something like: mass_array = list(mass_array[i] for i in range(1000))Then looking at the rest of your code, all the numpy functions you used are able to process an entire array of data at a time much faster than one at a time using your inner loop (j), so I simply removed the inner loop, and re-wrote the body to use vectorization:#Alotting Black Holes at z=6bhs={} #use a dictionary to avoid the need for preallocationfor i, arr in mass_array.items(): #items in python3 iteritems in python2        #inline Mbhthfit6 function, and calculate using vectorization (compute an entire array at once per iteration of `i`)    bhs[i] = np.random.lognormal(                                np.log((10**5.00041824)*(arr**0.31992748)),                                np.log(5)                                )again if you want to convert the bhs dictionary back to a list like you previously had, it&#8217;s quite simple: bhs = list(bhs[i] for i in range(1000))With these changes (and a relatively powerful PC) the code executed on the data files you provided in under half a second. with just over 700,000 values in the example dataset, if we extrapolate out to 80 million, that should be on the order of a minute or two.P.S. if you find yourself using exec a lot with generated strings of code, you&#8217;ll almost always find there&#8217;s a better way to accomplish the same thing usually with just a slightly different data structure.

Advertisement

Answer