How to Prevent The same job being executed by more than one Executor Threads

Hello,

we deploy our process app into a homogeneous environment. We have ‘asynchronous continuation’ enabled for all of our system tasks per Camunda’s best practices.

Recently, we found that the same job can be executed by more than one Executor thread. Looking at the log below, the thread-1 picked up the job first, and completed the job. But as it tried to update/remove the job in the job table, it had an OptimisticLockingException. We guess because of this failure, another thread-2 was used to pick up the same job and completed the job one more time.

Is this the expected behavior? how can be prevent the thread-2 pick up the same job that is already completed by thread-1?

“thread” : “camundaTaskExecutor-1”,

“applicationName” : “baseWorkflowApp”
“message” : “Service task Publish QC Request to HS completed for business key 2995294”,

**“timestamp” : “2020-03-25T16:34:33.318Z”,

**“thread” : “camundaTaskExecutor-1”,

**“logger” : “org.camunda.bpm.engine.jobexecutor”,

**“message” : “ENGINE-14006 Exception while executing job 7d86a28d-6eb6-11ea-86dc-005056aeee74: OptimisticLockingException. To see the full stacktrace set logging level to DEBUG.”,

{
“timestamp” : “2020-03-25T16:34:33.325Z”,
“thread” : “camundaTaskExecutor-2”,

**“message” : “Service task Publish QC Request to HS started for business key 2995294”,

}

Thank you so much for the help!

Jason

Hi Jason.

The engine is designed and ensures that a job is only executed by one job executor thread. The way this works is the job executor claims the job and ‘locks’ the job in the DB so other threads cannot take ownership.

However this claim is a lease and is usually set to 5 minutes. This is so that if a job crashes, it will eventually be picked up by another thread.

This can cause duplicate execution. If your job takes longer than the lease, another thread may claim the job. If the first thread completes, an optimistic locking exception is thrown because it detects that another thread has taken ownership of the job.

If your jobs are taking more than 5 mins, see if you can change them to asynchronous patterns. f you are making external API calls, if the API blocks for longer than 5 mins, set a more aggressive timeout etc.

Step 1 - confirm this is the cause, ie your job exectuin time exceeds the lease lock time.
Step 2 - identify ways to reduce this time or change to more async integration patterns
Step 3 - Set the job exec locktime to an approriate value (do this as a last resort…)

regards

Rob

2 Likes

Hello Rob,
Thank you so much for your reply.
However, I’d like to clarify some points from our observation:

  1. In our case, the execution time is far less than the lease lock time. In the example I attached herein, it only takes 3ms. Given that, it is not likely the job execution time exceeded the lease lock time.
  2. Currently, we have already implemented asynch pattern for potential long-running API calls
  3. Another observation is, once thread-1 got hit OptimisticLockingException, without lease locking expiration (I assume by default it’s 5 min), only after 13ms delay, thread-2 jumps in and picks up the same job.

Our question are the following:

  1. without lease lock expiration, how could thread-2 pick up the same job?
  2. In case of thread-1, it seems it was failed after the job execution is completed. In that case, is it true that another thread will always attempts to re-execute the same job, which results in duplicate job execution, in our case, duplicate API calls?
    Note: it seems the issue is consistent and across the board.

Here is another instance of the issue to help to illustrate the issue above. Basically, for the same business key 2996185, within one second, it was started->completed->started again:

{“timestamp”:“2020-03-26T11:56:54.480-04”,“level”:“INFO”,
“message”:"Service task Acquire ESS token started for business key 2996185",“context”:“default”}
{“timestamp”:“2020-03-26T11:56:54.483-04”,
“message”:"Service task Acquire ESS token completed for business key 2996185",“context”:“default”}
{“timestamp”:“2020-03-26T11:56:54.507-04”,“level”:“WARN”,
“message”:“ENGINE-14006 Exception while executing job 6a11408c-6f7a-11ea-89b1-005056aeee74: OptimisticLockingException. To see the full stacktrace set logging level to DEBUG.”,“context”:“default”}
{“timestamp”:“2020-03-26T11:56:54.520-04”,“level”:“INFO”,
“message”:"Service task Acquire ESS token started for business key 2996185",“context”:“default”}

Thank you!

Jason

Hi Jason,

Are you running in a cluster? Ive seen something similar when nodes in a cluster lose time sync.

Are you running a DB cluster or a DB without the correct isolation level? If transactions are using separate snapshots, then you could have threads tripping over each other…

regards

Rob

Hi Rob,

Yes, we are running in a cluster with two nodes. we had already ensured the time synced in both nodes.

We are not running in a DB cluster. Per Camunda’s best practice, we have already set the isolation level to be READ_COMMIT.

Can you suggest any other things we should look into? we are now trying to come up with a simple poc process which we can demo the issue. however, we are running into an intermittent class loading issue.

Thanks
Jason

Hi Jason,

Can you post your process model or a fragment which reproduces the error? Some obscure code conditions can cause optimistic locking exceptions…things line spawning additional threads or doing things like sending a message to yourself…

regards

Rob

Hi Rob,
Sorry for the delay. We’ve done more analysis and found the issue is, like you suspected, on our side. We had an accidental (i.e. it was not intentional) heterogeneous deployment, in a supposedly homogeneous environment. After we removed the heterogeneous deployment, the issue went away.

Thanks you for the help.

Jason

Do you understand why this happened? I don’t. If you do, please explain!

@fml2, what happened is we have two nodes cluster deployed with the same code base; separately, we had another slight different codebase deployed outside of the cluster but pointing to the same database.

We didn’t know exactly how the issue happened, but all what we know is with heterogeneous deployment (the one outside the cluster), it greatly increases the chances of Optimistic Locking.

I’m sorry, I still not get it. In my view, the job entries should be locked correctly regardless of whether you runt it as a cluster or not. IMO, all that counts is the database the job executor is connected to.

So it’s good it worked for you but I’d still be very interested in a real explanation – to avoid such problems :slight_smile: Any experts?

@fml2 , if you are running your camunda processes in cluster. Assume that you have 2 processes running 1 cluster and both talking to same database then you end up seeing OptimisticLockingException(s).

To understand the root cause for OptimisticLockingException(s)
do read this article

One way to reduce these exceptions is to use backoff strategy
read below articles for better understanding

https://docs.camunda.org/manual/7.15/user-guide/process-engine/the-job-executor/
read the section Backoff Strategy

https://docs.camunda.org/manual/7.15/user-guide/spring-boot-integration/configuration/
refer Camunda Engine Properties table

playaround with below properties
camunda.bpm.job-execution.backoff-time-in-millis
camunda.bpm.job-execution.max-backoff
camunda.bpm.job-execution.backoff-decrease-threshold

by tweaking these properties does not mean that that you will never see OptimisticLockingExceptions, but you will see less number of exceptions

@jchen5580 @Webcyberrob

Hello Rob,

Can we somehow disable the feature, that thread-2 should not pick up the job if job execution takes more time than lease time in a cluster environment?

In my case, we are having homogeneous deployment and facing the problem sometimes jobs are taking more execution time than lease time and it is causing duplicate execution of the same job.

We have tried to increase the lease time as well but post increasing lease time we did not understand why it’s increasing the job execution time as well.

Regards,
Vickky

Hi Vickky,

Welcome to the forum.

Changing the lease duration should not affect individual job performance, so thats a little curious…

An alternate pattern could be to use the external task pattern. That way you have much finer grained control over concurrency and lease management…

regards

Rob

Dear Rob,

Thanks for the quick response

  1. Regarding the job performance even though we are also not able to find out the reason that lease time is affecting it, do we have any workaround, whether we can prevent a single job can acquire by multiple job executors?

  2. We can not use the external task (it works with fetch and lock mechanism) for our use case we are using the sequence of the script task basically

Regards,
Vickky

@Webcyberrob: Thanks! Can you please guide us for the above-mentioned concerns?

Regards,
Viccky