Scaling multi-instance tasks, true parallel-execution

I’m trying to execute 20k instances of a multi-instance external-task in parallel within a reasonable amount of time using a non-exclusive multi-instance setup.
But it seems that I’m hitting some limitations, even if each external-tasks only logs a line and sleeps 50 ms, executing the first 1000 tasks takes 5 min on a Core i7 quad-core (while > 98% CPU load all the time).

Does s.o. have an idea what I misconfigured or if Camunda just doesn’t work for doing that many tasks in parallel?

(sample project available at GitHub - marsewe/camunda-external-task-test: Test for scaling up Camunda multi instances )

Execution finishes in ~30 seconds for 1000 tasks on my machine. I changed the thread pool executor as follows to avoid exceptions when submitting a task when the queue is full (which would require waiting until the 10 min lock time has expired before the task is fetched again):

this.noQueueExecutorService = new ThreadPoolExecutor(15, 15, 10000, TimeUnit.MILLISECONDS, new ArrayBlockingQueue<>(100), new CallerRunsPolicy());

I also changed the camunda:exclusive value to true on the multi instance asynchronous continuation. Note that the multi instance asyncAfter job requires synchronization in the process engine (it must join the multi-instance executions and decide when the outgoing sequence flow can be taken), so they must be executed in serial fashion anyway (which is why you see plenty optimistic locking exceptions when you set the flag to false).

1 Like

Thanks for your reply.
Maybe I did not describe my problem clearly enough.
It works fine with 1000 tasks and camunda:exclusive=true for the multi-instance.

But I cannot make this work for 20 000 tasks.
If running it for 20k tasks with these multi-instance-exclusive settings I noticed the following:

  1. my thread pool for the workers isn’t utilised at all even though there’s plenty of work to do, max 2 threads are busy. The other are waiting for tasks in the queue:
    • parking to wait for <0x00000006c8941260> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
      at java.util.concurrent.locks.LockSupport.park(
      at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(
      at java.util.concurrent.ArrayBlockingQueue.take(
      at java.util.concurrent.ThreadPoolExecutor.getTask(

See e.g. for some thread dumps.

  1. About twice per task an expensive DB-query is run, "SELECT * FROM ACT_RU_EXECUTION WHERE PROC_INST_ID_ = ‘2e2b8730-2eab-11e8-b9c4-7ed37dcdf91e’; " . This fetches all the approx 20 k rows and can take up to 600 ms.
    (I wonder if Camunda deserialises each row into a Java-object (?) )

(Github )


What is your job executor configuration? this thread may provide a little insight…



What I can see from experimenting a bit with it it takes quite a while until the External Task instances are available at their table. Executions for them already exist, but for each of them an entry to the External Task table needs to be created. Until then, the fetchAndLock query cannot find and return them and thus the worker cannot take care of them either.

In my rough measurements it took about 25s for a 1,000 multi-instance external tasks and about 70 min for 20,000 of them. The more instances the longer it takes – the query mentioned by @mase might be one of the causes.

This means, for 20,000 it takes roughly 70 min before the last external task can even be found by fetchAndLock and therefore delivered to any external worker.

My measurements happened without any fetchAndLock requests happening.

I also tried to experiment a bit with task executor core and max pool size, maxJobsPerAcquisition of the job executor as well as exclusive=false/true for the outer and multi instance.

As Webcyberrob pointed out the problem is indeed the job-acquisition.
Camunda’s AcquireJobsCmd tries to
List jobs = commandContext.getJobManager().findNextJobsToExecute(new Page(0, numJobsToAcquire));
but does not get any. ACT_RU_JOB has approx 19 500 entries but the fired query

select RES.*
where (RES.RETRIES_ > 0)
and (RES.DUEDATE_ is null or RES.DUEDATE_ <= ‘27-Mar-18’)
and (RES.LOCK_OWNER_ is null or RES.LOCK_EXP_TIME_ < ‘27-Mar-18’)
and ( ( RES.EXCLUSIVE_ = true
and not exists(
select J2.* from ACT_RU_JOB J2
– from the same proc. inst.
and (J2.EXCLUSIVE_ = true)
– also exclusive
and (J2.LOCK_OWNER_ is not null and J2.LOCK_EXP_TIME_ >= ‘27-Mar-18’)
– in progress
) )
or RES.EXCLUSIVE_ = false )

does not return any of those.
(Thus act_ru_ext_task table remains empty as well.)

Some thoughts on this:

About twice per task an expensive DB-query is run, "SELECT * FROM ACT_RU_EXECUTION WHERE PROC_INST_ID_ = ‘2e2b8730-2eab-11e8-b9c4-7ed37dcdf91e’; "

Yes, that can be expensive. You can consider distributing the single tasks among more process instances to make individual selects return less data. The same goes for insertion. With 20,000 instances in one multi-instance activity, you have a single transaction that inserts all those entries in ACT_RU_EXT_TASK and ACT_RU_JOB. I believe partitioning those could make sense.

It may also make sense to make those tests with the database you plan on using in production. H2 is not so well suited for performance tests in our experience.

Can you tell why the query returns no results?


It may also make sense to make those tests with the database you plan on using in production. H2 is not so well suited for performance tests in our experience.

I’ve tried it with Postgres as well, made no differences here.

Can you tell why the query returns no results?

It’s the sub-select. Most of the times there is at least one job for the same process-instance-id still in progress. So it does not acquire any new jobs.

So my current temporary conclusions are:

  1. Mulit-Instance per se becomes very slow when using it with tens of thousands of multi-instances. Re-model your processes (~ more independent process-instances instead of a big one orchestrating everything) so you do not run into these situations.
  2. On the one hand you should use "camunda:exclusive=true" to allow the process-engine to join the multi-instances executions trouble-free. On the other hand this blocks you from scaling out on multi-instances as no new jobs are acquired as long as one is still running.