How to implement multiprocessing in my python scri…

So the thing I’m trying to do is to have multiple cores process some stuff….

I have a py script which basically makes a feature vector for my ML and DL models. Now the thing I’m doing is that I take a txt file, I compare it with a list that I have initialized at runtime. So the code takes the first line in the txt and checks where that line is in the unique list and in that position, it does +1 in another list. Then that list is appended to another list which will then be written in a csv file, but that happens at the end of the execution. Now without threading, I was able to process 1 file at around 500ms to 1.5sec. When I pull up the activity monitor on my sys, I can see that at a time, any 1 core will be pinned to 100%. But when I do the multithreading stuff, all 10 cores would be sitting at around 10% to 20% only. And the execution isn’t that fast compared to my non-threading stuff.

I have 6464 files and my sys is either i5 8300H (4 cores) or i9 10900 (10 cores)

Here in find_files(), I basically made the script to find the files that I need. Some of the variables are initialized. I simply didn’t put it here to make the code complicated. Now at the end, you can see a thread helper function which basically is used to divide the calls to multiple threads.

def find_files():
internal_count = 0

output_path = script_path + "/output"
fam_dirs = os.listdir(output_path)
for fam_dir in fam_dirs:
    if fam_dir in malware_families:
        files = os.listdir(output_path + "/" + fam_dir + "/")
        for file in files:
            file_path = output_path + "/" + fam_dir + "/" + file
            internal_count += 1
            thread_helper(file_path, file, fam_dir, internal_count)

Here you can see the thread helper takes the ‘count’ arg and divides the calls to specific threads to evenout the load. I’ve even tried doing thread.join() too. But same issue.

def thread_helper(file_path, file, fam_dir, count):
if count % 4 == 0:
    t4 = threading.Thread(target=generate_feature_vector, args=(file_path, file, fam_dir))
    t4.start()
elif count % 3 == 0:
    t3 = threading.Thread(target=generate_feature_vector, args=(file_path, file, fam_dir))
    t3.start()
if count % 2 == 0:
    t2 = threading.Thread(target=generate_feature_vector, args=(file_path, file, fam_dir))
    t2.start()
else:
    t1 = threading.Thread(target=generate_feature_vector, args=(file_path, file, fam_dir))
    t1.start()


# wait until threads are completely executed
# t1.join()
# t2.join()
# t3.join()
# t4.join()

So in this function I basically take a file, open it, take the input lines one by one and see where it is in the unique list and I +1 the value in the feature vector list at that position corresponding to the unique list.

def generate_feature_vector(path, file_name, family):
try:
    with open(path, 'r') as file:
        lines = file.read()
        lines = lines.split("n")
        
        feature_vector = list()

        # Init feature_vector to 0 with the length of the unique list of that feature
        for i in range(len(feature_ls)):
            feature_vector.append([0] * len(unique_ls[i]))
        

        
        # +1 the value in the cell corresponding to the file_row and feature_name_column and append family name
        for line in lines:
            for i in range( len(feature_ls) ):
                if line in unique_ls[i]:
                    feature_vector[i][ unique_ls[i].index(line) ] += 1
                feature_vector[i][ len(feature_vector[i])-1 ] = family

I’m not sure where I messed up. So can you guys helpz me divide the load to multiple CPU cores? Also please do correct me if I messed up terms while asking this question….

Answer

Because of python’s GIL, only one thread is executed at once, making threading effective only for I/O bounded applications, not CPU-bounded, like yours.

To parallelize your program, I recommend using multiprocessing module, which is quite similar to threading in its use but implements true parallelism.

How to implement multiprocessing in my python script?

Advertisement

Answer