Weird JOB executor behavior

Hi,

I’ve a problem where JOB executor is running long running service tasks too frequently.
My understanding is if we have service task which exceeds 5 min run time - then Camunda will re-run process for 3 times unless it’s optimistic lock exception.
I will provide step by step log from server and db content and ask some questions - I think I don’t understand some details.

My env:
win10/h2/camunda 7.10.0/spring boot 5.1.2

Here is doc which I tried to understand but example bellow doesnt seems to be in line:

Especially this piece:
…Upon failure of job execution, e.g., if a service task invocation throws an exception, a job will be retried a number of times (by default 3). It is not immediately retried and added back to the acquisition queue…

Code used in this sample: https://drive.google.com/file/d/1GM0VhDNU7ue6DOJVI-e2SOyJhE8D9Pkb/view?usp=drivesdk

step 1)

LOG

2020-09-21 07:55:32.924 INFO 17320 — [aTaskExecutor-3] slow.task.LongRunningServiceTask : 1600689332924: START - Sleep for 6 min.
2020-09-21 08:01:32.926 INFO 17320 — [aTaskExecutor-3] slow.task.LongRunningServiceTask : 1600689332924: END - Sleep is done.

DB

TIMESTAMP_UTC_24H 2020-09-21 08:01:32
JOB_STATE_: 1
JOB_RETRIES_: 3
JOB_EXCEPTION_MSG_: ENGINE-03005 Execution of ‘DELETE TimerEntity[5a4e50a1-fc01-11ea-ac5d-0205857feb80]’ failed. Entity was updated by another transaction concurrently.

In this step I started job manually from Camunda Cockpit - finding failingProcess and starting it, to simulated exact situation I want ot test it has timer - alhough it’s set in the past so it tirggers immidietally.
I’ve no quesiton to this step - note that DB is comming from query on act_hi_job_log table.
Also I see exception in DB and examining log shows OptimisticLockException which is fine. The retries column is not increased - and that’s in line with docs saying that if we have OptimisticLockException then retries is not decreased.

step 2)

LOG

2020-09-21 08:00:48.424 INFO 17320 — [aTaskExecutor-1] slow.task.LongRunningServiceTask : 1600689648424: START - Sleep for 6 min.
2020-09-21 08:06:48.425 INFO 17320 — [aTaskExecutor-1] slow.task.LongRunningServiceTask : 1600689648424: END - Sleep is done.

DB

TIMESTAMP_UTC_24H: 2020-09-21 08:06:48
JOB_STATE_: 1
JOB_RETRIES_: 3
JOB_EXCEPTION_MSG_: ENGINE-03005 Execution of ‘DELETE TimerEntity[5a4e50a1-fc01-11ea-ac5d-0205857feb80]’ failed. Entity was updated by another transaction concurrently.

Here Camunda decides to restart job after roughly 5 minutes, it’s ok as 6 min service task is above 5 min.
Note that again we have OptimisticLockException which is not decresing retries column - that’s ok, no questions here.

step 3)

LOG

2020-09-21 08:01:33.065 INFO 17320 — [aTaskExecutor-2] slow.task.LongRunningServiceTask : 1600689693065: START - Sleep for 6 min.
2020-09-21 08:07:33.067 INFO 17320 — [aTaskExecutor-2] slow.task.LongRunningServiceTask : 1600689693065: END - Sleep is done.

DB

TIMESTAMP_UTC_24H: 2020-09-21 08:07:33
JOB_STATE_: 1
JOB_RETRIES_: 3
JOB_EXCEPTION_MSG_: ENGINE-03083 Exception while executing Batch Database Operations with message ’ ### Error flushing statements. Cause: org.apache.ibatis.executor.BatchExecutorException: org.camunda.bpm.engine.impl.persistence.entity.VariableInstanceEntity.deleteByteArrayNoRevisionCheck (batch index #5) failed. 4 prior sub executor(s) completed successfully, but will be rolled back. Cause: org.h2.jdbc.JdbcBatchUpdateException: Referential integrity constraint violation: “ACT_FK_JOB_EXCEPTION: PUBLIC.ACT_RU_JOB FOREIGN KEY(EXCEPTION_STACK_ID_) REFERENCES PUBLIC.ACT_GE_BYTEARRAY(ID_) (‘30fa85ef-fc02-11ea-ac5d-0205857feb80’)”; SQL statement:
delete from ACT_GE_BYTEARRAY where

Here again Camunda restarts our 6 min service task, but it’s bit odd:
Question-3.1:
If only locked jobs can be executed - how come we see that Camunda executes service even though it’s already executing it in step (2) above? I could understand if that would happen around 08:05+ (five minutes after starting service in step 2) - but here we see it’s less then 1 minute after starting job in step 2? Interestingly if you look closely this step seems to be linked with step 1 - as this execution happens at 08:01:33 which is just 1 second after step 1 finished at 08:01:32.
Question-3.2:
Why error is different then in step 1 and 2? looks like it’s refering to removal of the exception stack from db?

step 4)

LOG

2020-09-21 08:06:48.520 INFO 17320 — [aTaskExecutor-3] slow.task.LongRunningServiceTask : 1600690008520: START - Sleep for 6 min.
2020-09-21 08:12:48.520 INFO 17320 — [aTaskExecutor-3] slow.task.LongRunningServiceTask : 1600690008520: END - Sleep is done.

DB

TIMESTAMP_UTC_24H: 2020-09-21 08:12:48
JOB_STATE_:1
JOB_RETRIES_: 2
JOB_EXCEPTION_MSG_: ENGINE-03083 Exception while executing Batch Database Operations with message ’ ### Error flushing statements. Cause: org.apache.ibatis.executor.BatchExecutorException: org.camunda.bpm.engine.impl.persistence.entity.VariableInstanceEntity.deleteByteArrayNoRevisionCheck (batch index #5) failed. 4 prior sub executor(s) completed successfully, but will be rolled back. Cause: org.h2.jdbc.JdbcBatchUpdateException: Referential integrity constraint violation: “ACT_FK_JOB_EXCEPTION: PUBLIC.ACT_RU_JOB FOREIGN KEY(EXCEPTION_STACK_ID_) REFERENCES PUBLIC.ACT_GE_BYTEARRAY(ID_) (‘30fa85ef-fc02-11ea-ac5d-0205857feb80’)”; SQL statement:
delete from ACT_GE_BYTEARRAY where

Question-4.1:
This execution looks like continueation just after step 2 finishes at 08:06:48 - why is that in line with the time when step 2 finishes?
Question-4.2:
Why here the exception is not like in step 1 and 2 (OptimisticLockException) ?

step 5)

LOG

2020-09-21 08:07:33.191 INFO 17320 — [aTaskExecutor-1] slow.task.LongRunningServiceTask : 1600690053191: START - Sleep for 6 min.
2020-09-21 08:13:33.191 INFO 17320 — [aTaskExecutor-1] slow.task.LongRunningServiceTask : 1600690053191: END - Sleep is done.

DB

TIMESTAMP_UTC_24H: 2020-09-21 08:13:33
JOB_STATE_: 1
JOB_RETRIES_: 1
ENGINE-03083 Exception while executing Batch Database Operations with message ’ ### Error flushing statements. Cause: org.apache.ibatis.executor.BatchExecutorException: org.camunda.bpm.engine.impl.persistence.entity.VariableInstanceEntity.deleteByteArrayNoRevisionCheck (batch index #5) failed. 4 prior sub executor(s) completed successfully, but will be rolled back. Cause: org.h2.jdbc.JdbcBatchUpdateException: Referential integrity constraint violation: “ACT_FK_JOB_EXCEPTION: PUBLIC.ACT_RU_JOB FOREIGN KEY(EXCEPTION_STACK_ID_) REFERENCES PUBLIC.ACT_GE_BYTEARRAY(ID_) (‘30fa85ef-fc02-11ea-ac5d-0205857feb80’)”; SQL statement:
delete from ACT_GE_BYTEARRAY where

Question-5.1
Why this execution doesn’t wait into 5th minute of step 4 execution but instead executes just after step 3 finishes?
Question-5.2:
Why here the exception is not like in step 1 and 2 (OptimisticLockException) ?

step 6)

LOG

2020-09-21 08:12:48.616 INFO 17320 — [aTaskExecutor-2] slow.task.LongRunningServiceTask : 1600690368616: START - Sleep for 6 min.
2020-09-21 08:18:48.617 INFO 17320 — [aTaskExecutor-2] slow.task.LongRunningServiceTask : 1600690368616: END - Sleep is done.
DB
TIMESTAMP_UTC_24H: 2020-09-21 08:18:48
JOB_STATE_: 1
JOB_RETRIES_: 0
JOB_EXCEPTION_MSG_: ENGINE-03083 Exception while executing Batch Database Operations with message ’ ### Error flushing statements. Cause: org.apache.ibatis.executor.BatchExecutorException: org.camunda.bpm.engine.impl.persistence.entity.VariableInstanceEntity.deleteByteArrayNoRevisionCheck (batch index #5) failed. 4 prior sub executor(s) completed successfully, but will be rolled back. Cause: org.h2.jdbc.JdbcBatchUpdateException: Referential integrity constraint violation: “ACT_FK_JOB_EXCEPTION: PUBLIC.ACT_RU_JOB FOREIGN KEY(EXCEPTION_STACK_ID_) REFERENCES PUBLIC.ACT_GE_BYTEARRAY(ID_) (‘30fa85ef-fc02-11ea-ac5d-0205857feb80’)”; SQL statement:
delete from ACT_GE_BYTEARRAY where

Question-6.1
Same 2 questions like in 5.1 and 5.2

Maybe your job lock expiration time is lower than 6min?
so your first service task execution keeps running but another job executor will start to run it too after the lock expiration time.

Please have a look at these properties like lockTimeInMillis wich default is 5 minutes:

https://docs.camunda.org/manual/7.15/reference/deployment-descriptors/tags/job-executor/#job-acquisition-configuration-properties