We have a Java Spring Boot application that’s using embedded Camunda, version 7.12.0. We’ve been seeing poor performance under heavy load, and it appears to be at least partially due to a backlog where jobs are queued for acquisition, but not enough jobs are being acquired at a time.
I wrote the following query to monitor the act_ru_log
table so I can see how many jobs are acquirable, and how many are actually being executed at a given time.
select sum(case when (lock_exp_time_ is null and lock_owner_ is null) then 1 else 0 end) as "Queued",
count(lock_owner_) as "Active"
from act_ru_job arj
where retries_ > 0;
I never see more than 15 or so jobs being executed at a given time, but sometimes there are 50-60 queued jobs. We use a single server for this application, but it’s not having any memory or CPU issues right now, so I’m confident we can have more active jobs running there.
The Java command line for the application has the following properties defined:
-Dcamunda.bpm.job-execution.core-pool-size=60
-Dcamunda.bpm.job-execution.max-pool-size=100
-Dcamunda.bpm.job-execution.queue-capacity=100
These were recently bumped up from lower values. Even after making those changes, I see no improvement according to my query: it still seems that only 15 jobs are executed at any given time.
Am I correct in my understanding of the results from act_ru_log
? Where else should I be looking at to increase the number of jobs being executed?
Hello @jpzawisza1 ,
how is the database connection pool configured? Each running job needs a connection as it runs a transaction, so here could be a bottleneck as well.
Also, I would recommend to go to smaller values (for example: core pool 20, max pool 30, queue size 50) and eventually increase again.
I hope this helps
Jonathan
Thanks, @jonathan.lukas . We use Hikari for our database connection pool, and the maximum number of connections is set to 40. I’ll increase that number and see what happens.
I’ll also set the other values (core pool, max pool, queue size) to what you recommended, and retry.
I’ll report back when I have more information, which probably won’t be until early next week.
1 Like
@jonathan.lukas, I’m sorry to report that the changes have made no difference, at least according to the database query I mentioned before. We now have the following values configured.
-Dcamunda.bpm.job-execution.core-pool-size=20
-Dcamunda.bpm.job-execution.max-pool-size=30
-Dcamunda.bpm.job-execution.queue-capacity=50
-Dspring.datasource.hikari.maximum-pool-size=80
But when I run my original database query, I still see no more than 15 or so jobs at a time being run, and at least a dozen queued up at any time.
Is there anything else we should try? It may be worth noting that some of the jobs in question are expected to take a couple minutes to complete, as they’re waiting for responses from downstream systems.
Hello @jpzawisza1 ,
sorry to hear that.
But your last section gave us a good hint.
These jobs that are taking so long are actually an anti-pattern. The problem is that your job executor is actively blocked by them, while your database has an open connection (and transaction) for each of them.
My recommendation would be to refactor long-running tasks to either be async (send request, receive response later) or use the external task pattern:
If you say that although you have increased capacity of everything, your engine will only acquire 15 jobs at once, have you verified that the way you configure this is actually working? The way it is configured looks like you are using spring boot. Is there a reason why you do not configure the job executor directly in the properties (as profile)?
jonathan
Hi @jonathan.lukas ,
Agreed that our Camunda workflow is sub-optimal, and could be improved on.
Having said that, even if a long-running task is blocking the executor, shouldn’t an increase in the number of database connections lead to some improvement in the number of tasks that are being picked up at any given time? If the number of connections was doubled, I would think the executor could pick up twice as much work, assuming enough application threads are available.
We’re not able to invoke the Spring Boot actuator/env
endpoint in our production environment, but we’ve followed the exact same steps to modify the values in a staging environment, and used actuator/env
in that environment to confirm those steps do as we expect.
We do have settings in a Spring Boot properties file, but we can override a number of them in a Dockerfile, allowing for them to be easily modified at runtime.
Hello @jpzawisza1 ,
your assumption is correct.
How many process instances are running concurrently?
Jonathan
Hi @jonathan.lukas,
Currently, we have a single process instance.
We’re going to experiment with using spring.datasource.maximum-pool-size
instead of spring.datasource.hikari.maximum-pool-size
, based on some things we’ve found online, and see what happens.
Hello @jpzawisza1 ,
why is the load high with a single process instance? What does it do?
Jonathan
Hi @jonathan.lukas,
I’m sorry, I think I misunderstood your question. We have a single instance of the application that’s using embedded Camunda, but the number of Camunda process instances running at a given time never goes beyond 15-20.
Jim
Hello @jpzawisza1 ,
thank you for clarifying and sorry if my question was not clear.
But the fact that you never have more than 15-20 process instances would explain why you also do not have more jobs active.
As long as an exclusive flag is set on an async (job definition), it will only be executed exclusive for a process instance, meaning there cannot be any other job active for the same process instance at the same time.
Please try removing the flag from the asyncs that might be executed in parallel. Please keep in mind that this could lead to race conditions regarding variable modification.
Jonathan