Jobs not getting acquired quickly enough

jpzawisza1 · May 19, 2023, 4:29pm

We have a Java Spring Boot application that’s using embedded Camunda, version 7.12.0. We’ve been seeing poor performance under heavy load, and it appears to be at least partially due to a backlog where jobs are queued for acquisition, but not enough jobs are being acquired at a time.

I wrote the following query to monitor the act_ru_log table so I can see how many jobs are acquirable, and how many are actually being executed at a given time.

select sum(case when (lock_exp_time_ is null and lock_owner_ is null) then 1 else 0 end) as "Queued",
   count(lock_owner_) as "Active"
 from act_ru_job arj
  where retries_ > 0;

I never see more than 15 or so jobs being executed at a given time, but sometimes there are 50-60 queued jobs. We use a single server for this application, but it’s not having any memory or CPU issues right now, so I’m confident we can have more active jobs running there.

The Java command line for the application has the following properties defined:

-Dcamunda.bpm.job-execution.core-pool-size=60
-Dcamunda.bpm.job-execution.max-pool-size=100
-Dcamunda.bpm.job-execution.queue-capacity=100

These were recently bumped up from lower values. Even after making those changes, I see no improvement according to my query: it still seems that only 15 jobs are executed at any given time.

Am I correct in my understanding of the results from act_ru_log? Where else should I be looking at to increase the number of jobs being executed?

jonathan.lukas · May 19, 2023, 6:22pm

Hello @jpzawisza1 ,

how is the database connection pool configured? Each running job needs a connection as it runs a transaction, so here could be a bottleneck as well.

Also, I would recommend to go to smaller values (for example: core pool 20, max pool 30, queue size 50) and eventually increase again.

I hope this helps

Jonathan

jpzawisza1 · May 19, 2023, 6:34pm

Thanks, @jonathan.lukas . We use Hikari for our database connection pool, and the maximum number of connections is set to 40. I’ll increase that number and see what happens.

I’ll also set the other values (core pool, max pool, queue size) to what you recommended, and retry.

I’ll report back when I have more information, which probably won’t be until early next week.

jpzawisza1 · May 22, 2023, 4:10pm

@jonathan.lukas, I’m sorry to report that the changes have made no difference, at least according to the database query I mentioned before. We now have the following values configured.

-Dcamunda.bpm.job-execution.core-pool-size=20
-Dcamunda.bpm.job-execution.max-pool-size=30
-Dcamunda.bpm.job-execution.queue-capacity=50
-Dspring.datasource.hikari.maximum-pool-size=80

But when I run my original database query, I still see no more than 15 or so jobs at a time being run, and at least a dozen queued up at any time.

Is there anything else we should try? It may be worth noting that some of the jobs in question are expected to take a couple minutes to complete, as they’re waiting for responses from downstream systems.

jonathan.lukas · May 22, 2023, 6:09pm

Hello @jpzawisza1 ,

sorry to hear that.

But your last section gave us a good hint.

These jobs that are taking so long are actually an anti-pattern. The problem is that your job executor is actively blocked by them, while your database has an open connection (and transaction) for each of them.

My recommendation would be to refactor long-running tasks to either be async (send request, receive response later) or use the external task pattern:

If you say that although you have increased capacity of everything, your engine will only acquire 15 jobs at once, have you verified that the way you configure this is actually working? The way it is configured looks like you are using spring boot. Is there a reason why you do not configure the job executor directly in the properties (as profile)?

jonathan

jpzawisza1 · May 22, 2023, 8:15pm

Hi @jonathan.lukas ,

Agreed that our Camunda workflow is sub-optimal, and could be improved on.

Having said that, even if a long-running task is blocking the executor, shouldn’t an increase in the number of database connections lead to some improvement in the number of tasks that are being picked up at any given time? If the number of connections was doubled, I would think the executor could pick up twice as much work, assuming enough application threads are available.

We’re not able to invoke the Spring Boot actuator/env endpoint in our production environment, but we’ve followed the exact same steps to modify the values in a staging environment, and used actuator/env in that environment to confirm those steps do as we expect.

We do have settings in a Spring Boot properties file, but we can override a number of them in a Dockerfile, allowing for them to be easily modified at runtime.

jonathan.lukas · May 24, 2023, 7:16am

Hello @jpzawisza1 ,

your assumption is correct.

How many process instances are running concurrently?

Jonathan

jpzawisza1 · May 24, 2023, 8:37pm

Hi @jonathan.lukas,

Currently, we have a single process instance.

We’re going to experiment with using spring.datasource.maximum-pool-size instead of spring.datasource.hikari.maximum-pool-size, based on some things we’ve found online, and see what happens.

jonathan.lukas · May 24, 2023, 8:42pm

Hello @jpzawisza1 ,

why is the load high with a single process instance? What does it do?

Jonathan

jpzawisza1 · May 26, 2023, 8:47pm

Hi @jonathan.lukas,

I’m sorry, I think I misunderstood your question. We have a single instance of the application that’s using embedded Camunda, but the number of Camunda process instances running at a given time never goes beyond 15-20.

Jim

jonathan.lukas · May 26, 2023, 9:03pm

Hello @jpzawisza1 ,

thank you for clarifying and sorry if my question was not clear.

But the fact that you never have more than 15-20 process instances would explain why you also do not have more jobs active.

As long as an exclusive flag is set on an async (job definition), it will only be executed exclusive for a process instance, meaning there cannot be any other job active for the same process instance at the same time.

Please try removing the flag from the asyncs that might be executed in parallel. Please keep in mind that this could lead to race conditions regarding variable modification.

Jonathan

jpzawisza1 · June 1, 2023, 2:56pm

Hi @jonathan.lukas,

Apologies for the delayed response.

To make sure I’m giving you accurate information, I want to clarify my own understanding, as I’m somewhat new to Camunda. When we talk about process instances, I’m using the definition from Process instance creation | Camunda Platform 8 Docs , but I’m not entirely sure of the distinction between a process instance and a job. It seems like a job is a task that is responsible for the execution of a process instance: is that correct?

Regarding the database connection pool, it turns out we configure Hikari in a non-standard way, which was causing the default parameters not to be respected. We’ve put in a fix to read the maximum pool size correctly, which we’ll test next week in our production environment.

For the benefit of anyone else reading this, these are some points I’ve found in my research.

Some online sources list the maximum pool size parameter as maximum-pool-size, but others use maximumPoolSize. Based on the Hikari source code, I think maximumPoolSize is correct, but I can’t confirm this due to our non-standard setup.
Whether the parameter should be spring.datasource.hikari.maximumPoolSize or spring.datasource.maximumPoolSize seems to depend on which version of Spring Boot 2 you’re running.

Jim

jonathan.lukas · June 2, 2023, 11:05am

Hello @jpzawisza1 ,

the link you provided is from the Camunda 8 docs. To not confuse you, here is a link to the Camunda 7 docs:

A process instance is an instance of your process definition (you provide as diagram) while a job is what the process instance executes async if you have transaction boundaries. Please read more about it here:

For Spring Boot (Standard, the versions I am aware of which is 2.5 onwards), the camelCase (maximumPoolSize) or kebap-case (maximum-pool-size) does not matter, both will match with maximumPoolSize.

I hope this helps

Jonathan

jpzawisza1 · June 12, 2023, 8:49pm

Hi @jonathan.lukas,

We increased the number of database pool connections in our production environment last week, but it made no difference: in the act_ru_log table, we still see no more than 15 or so jobs being executed at once, and many more in a queued state.

It’s possible that due to our non-standard implementation of our Hikari setup, our change isn’t doing what we expect. We plan to enable some of the debugging logging described at The Job Executor: What Is Going on in My Process Engine? | Camunda and see what it tells us.

Aside from what’s mentioned in the blog post, is there any other logging we could turn on to confirm the Hikari settings that Camunda is using? I can look through the source code at GitHub - camunda/camunda-bpm-platform: Flexible framework for workflow and decision automation with BPMN and DMN. Integration with Spring, Spring Boot, CDI. , but I wanted to ask as well.

jpzawisza1 · June 23, 2023, 8:13pm

Hi @jonathan.lukas,

Apologies again for the delay in responding.

For the benefit of others reading this, I looked at some of the queries at camunda-7-code-examples/snippets/db-queries-for-monitoring at main · camunda-consulting/camunda-7-code-examples · GitHub , and I realized that my original query was incorrect. I should have used the following query to determine the number of queued and active jobs:

select sum(case when (retries_  > 0 and duedate_ is null and lock_owner_ is null and (suspension_state_ = 1 or suspension_state_ is null)) then 1 else 0 end) as "Queued",
   count(lock_owner_) as "Active"
 from act_ru_job arj;

Using that query, the number of active jobs is in line with the number of available database connections. We’re still having performance issues, but we’re going to look at our workflows in more detail and see how we can improve them as per your earlier suggestions.

I’ll mark your answer from earlier in the thread as the solution. Thanks for all your help.

system · June 30, 2023, 8:13pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.