I’ve noticed that the distribution of Builds between workers is sub-optimal, 80% of the time Builds are running on busy workers.
If you have a look at the image, tmp_worker1
can process triggered_build_1
, but instead, it’s idle!!! For some reason, triggered_build_1
is in acquiring a locked state and is assigned to the busy example-worker
I have the next setup:
- 3 workers
- 1 main builder
- 3 triggerable builders (with locks)
Main source code below
# triggerable scheduler c['schedulers'].append(schedulers.Triggerable(name="trigger_from_main", builderNames=['triggered_build_0', 'triggered_build_1', 'triggered_build_2'])) # main builder factory factory_main = util.BuildFactory() # trigger factory_main.addStep(steps.Trigger( schedulerNames=['trigger_from_main'], waitForFinish=True, haltOnFailure=True, name='trigger' )) # main builder c['builders'].append( util.BuilderConfig(name="test_main", workernames=['example-worker', 'tmp_worker0', 'tmp_worker1'], factory=factory_main, ) ) # lock worker_lock = [util.WorkerLock("worker_builds", maxCount=1).access('counting')] # 1st of 3 sub-builder c['builders'].append( util.BuilderConfig(name="triggered_build_0", workernames=['example-worker', 'tmp_worker0', 'tmp_worker1'], factory=factory_subbuild, locks=worker_lock, ) ) # 2nd of 3 sub-builder ... # 3rd of 3 sub-builder ...
Advertisement
Answer
This behavior is triggered because of locks and the way the build are distributed by the master.
When a build need to be run, the master do this following step (link to the source code):
- Select randomly a worker
- Check if the worker and the build match
- Check if the build can take lock(s) needed on the worker
- Assign the build to the master
And when the build start on the worker, it take the lock.
So if the master check the requirement of the next build to dispatch before the lock was acquired, it can dispatch a new build on the same worker (even if they need the same lock).
You can fix this if you put into quarantine the worker where the build is assign in order to give the worker enough time to take the lock.
You can do this with the canStartBuild
function which is run just before assigning the build on the worker (docs).
def canStartBuildLockQuarantine(builder, wfb, request): # Put the worker in quarantine for 5 seconds wfb.worker.quarantine_timeout = 5 wfb.worker.putInQuarantine() # Reset wfb.worker.quarantine_timeout wfb.worker.resetQuarantine() return True
And give it to the worker that will take locks.
c['builders'].append( util.BuilderConfig(name="triggered_build_0", workernames=['example-worker', 'tmp_worker0', 'tmp_worker1'], factory=factory_subbuild, canStartBuild=canStartBuildLockQuarantine, locks=worker_lock, ) )