Is spliting my program into 8 separate processes the best approach (performance wise) when I have 8 logical cores in python?

Intro

I have rewritten one of my previously sequential algorithms to work in a parallel fashion (we are talking about real parallelism, not concurrency or threads). I run a batch script that runs my “worker” python nodes and each will perform the same task but on a different offset (no data sharing between processes). If it helps visualize imagine a dummy example with an API endpoint which on [GET] request sends me a number and I have to guess if it is even or odd so I run my workers. This example gets the point across as I can’t share the algorithm but let’s say that the routine for a single process is already optimized to the maximum.

Important: the processes are executed on Windows10 with admin privileges and real_time priority

Diagnostics

Is the optimal number of work node processes equal to the number of logical cores (i.e. 8)? When I use task manager I see my CPU hit 100% limit on all cores but when I look at the processes they each take about 6%? With 6% * 8 = 48% how does this make sense? On idle (without the processes) my CPU sits at about 0-5% total.

I’ve tried to diagnose it with Performance Monitor but the results were even more confusing:

Reasoning I didn’t know how to configure Performance Manager to track my processes across separate cores so I used total CPU time as the Y-axis. How can I have a minimum of 20% usage on 8 processes which means 120% utilization?

Question 1

This doesn’t make much sense and the numbers are different from what the task manager shows. Worse of it all is the bolded blue line which shows total (average) CPU performance across all cores and this doesn’t seem to exceed 70% when the task manager says all my cores run at 100%? What am I confusing here?

Question 2

Is running X processes where X is the number of logical cores on the system under real_time priority the best I can do? (and let the OS handle the scheduling logic)? In the second picture from the bar chart, I can see that it is doing a decent job as ideally, all those bars would be of an equal height which is roughly true.

Answer

I have found the answer to this question and have decided to post rather than delete. I used the psutil library to set the affinity of each worker process manually and distribute them instead of the OS. I have had MANY IO operations on the network and from debug prints which caused my processor cores to not be able to max out 100% (after all windows is no real-time operating system)

In addition to this, since I’ve tested the code on my laptop, I’ve encountered thermal throttling which caused disturbances in the %usage calculations.

Advertisement

Answer