Hi,
I’ve a problem where JOB executor is running long running service tasks too frequently.
My understanding is if we have service task which exceeds 5 min run time - then Camunda will re-run process for 3 times unless it’s optimistic lock exception.
I will provide step by step log from server and db content and ask some questions - I think I don’t understand some details.
My env:
win10/h2/camunda 7.10.0/spring boot 5.1.2
Here is doc which I tried to understand but example bellow doesnt seems to be in line:
Especially this piece:
…Upon failure of job execution, e.g., if a service task invocation throws an exception, a job will be retried a number of times (by default 3). It is not immediately retried and added back to the acquisition queue…
Code used in this sample: https://drive.google.com/file/d/1GM0VhDNU7ue6DOJVI-e2SOyJhE8D9Pkb/view?usp=drivesdk
step 1)
LOG
2020-09-21 07:55:32.924 INFO 17320 — [aTaskExecutor-3] slow.task.LongRunningServiceTask : 1600689332924: START - Sleep for 6 min.
2020-09-21 08:01:32.926 INFO 17320 — [aTaskExecutor-3] slow.task.LongRunningServiceTask : 1600689332924: END - Sleep is done.
DB
TIMESTAMP_UTC_24H 2020-09-21 08:01:32
JOB_STATE_: 1
JOB_RETRIES_: 3
JOB_EXCEPTION_MSG_: ENGINE-03005 Execution of ‘DELETE TimerEntity[5a4e50a1-fc01-11ea-ac5d-0205857feb80]’ failed. Entity was updated by another transaction concurrently.
In this step I started job manually from Camunda Cockpit - finding failingProcess and starting it, to simulated exact situation I want ot test it has timer - alhough it’s set in the past so it tirggers immidietally.
I’ve no quesiton to this step - note that DB is comming from query on act_hi_job_log table.
Also I see exception in DB and examining log shows OptimisticLockException which is fine. The retries column is not increased - and that’s in line with docs saying that if we have OptimisticLockException then retries is not decreased.
step 2)
LOG
2020-09-21 08:00:48.424 INFO 17320 — [aTaskExecutor-1] slow.task.LongRunningServiceTask : 1600689648424: START - Sleep for 6 min.
2020-09-21 08:06:48.425 INFO 17320 — [aTaskExecutor-1] slow.task.LongRunningServiceTask : 1600689648424: END - Sleep is done.
DB
TIMESTAMP_UTC_24H: 2020-09-21 08:06:48
JOB_STATE_: 1
JOB_RETRIES_: 3
JOB_EXCEPTION_MSG_: ENGINE-03005 Execution of ‘DELETE TimerEntity[5a4e50a1-fc01-11ea-ac5d-0205857feb80]’ failed. Entity was updated by another transaction concurrently.
Here Camunda decides to restart job after roughly 5 minutes, it’s ok as 6 min service task is above 5 min.
Note that again we have OptimisticLockException which is not decresing retries column - that’s ok, no questions here.
step 3)
LOG
2020-09-21 08:01:33.065 INFO 17320 — [aTaskExecutor-2] slow.task.LongRunningServiceTask : 1600689693065: START - Sleep for 6 min.
2020-09-21 08:07:33.067 INFO 17320 — [aTaskExecutor-2] slow.task.LongRunningServiceTask : 1600689693065: END - Sleep is done.
DB
TIMESTAMP_UTC_24H: 2020-09-21 08:07:33
JOB_STATE_: 1
JOB_RETRIES_: 3
JOB_EXCEPTION_MSG_: ENGINE-03083 Exception while executing Batch Database Operations with message ’ ### Error flushing statements. Cause: org.apache.ibatis.executor.BatchExecutorException: org.camunda.bpm.engine.impl.persistence.entity.VariableInstanceEntity.deleteByteArrayNoRevisionCheck (batch index #5) failed. 4 prior sub executor(s) completed successfully, but will be rolled back. Cause: org.h2.jdbc.JdbcBatchUpdateException: Referential integrity constraint violation: “ACT_FK_JOB_EXCEPTION: PUBLIC.ACT_RU_JOB FOREIGN KEY(EXCEPTION_STACK_ID_) REFERENCES PUBLIC.ACT_GE_BYTEARRAY(ID_) (‘30fa85ef-fc02-11ea-ac5d-0205857feb80’)”; SQL statement:
delete from ACT_GE_BYTEARRAY where
Here again Camunda restarts our 6 min service task, but it’s bit odd:
Question-3.1:
If only locked jobs can be executed - how come we see that Camunda executes service even though it’s already executing it in step (2) above? I could understand if that would happen around 08:05+ (five minutes after starting service in step 2) - but here we see it’s less then 1 minute after starting job in step 2? Interestingly if you look closely this step seems to be linked with step 1 - as this execution happens at 08:01:33 which is just 1 second after step 1 finished at 08:01:32.
Question-3.2:
Why error is different then in step 1 and 2? looks like it’s refering to removal of the exception stack from db?
step 4)
LOG
2020-09-21 08:06:48.520 INFO 17320 — [aTaskExecutor-3] slow.task.LongRunningServiceTask : 1600690008520: START - Sleep for 6 min.
2020-09-21 08:12:48.520 INFO 17320 — [aTaskExecutor-3] slow.task.LongRunningServiceTask : 1600690008520: END - Sleep is done.
DB
TIMESTAMP_UTC_24H: 2020-09-21 08:12:48
JOB_STATE_:1
JOB_RETRIES_: 2
JOB_EXCEPTION_MSG_: ENGINE-03083 Exception while executing Batch Database Operations with message ’ ### Error flushing statements. Cause: org.apache.ibatis.executor.BatchExecutorException: org.camunda.bpm.engine.impl.persistence.entity.VariableInstanceEntity.deleteByteArrayNoRevisionCheck (batch index #5) failed. 4 prior sub executor(s) completed successfully, but will be rolled back. Cause: org.h2.jdbc.JdbcBatchUpdateException: Referential integrity constraint violation: “ACT_FK_JOB_EXCEPTION: PUBLIC.ACT_RU_JOB FOREIGN KEY(EXCEPTION_STACK_ID_) REFERENCES PUBLIC.ACT_GE_BYTEARRAY(ID_) (‘30fa85ef-fc02-11ea-ac5d-0205857feb80’)”; SQL statement:
delete from ACT_GE_BYTEARRAY where
Question-4.1:
This execution looks like continueation just after step 2 finishes at 08:06:48 - why is that in line with the time when step 2 finishes?
Question-4.2:
Why here the exception is not like in step 1 and 2 (OptimisticLockException) ?
step 5)
LOG
2020-09-21 08:07:33.191 INFO 17320 — [aTaskExecutor-1] slow.task.LongRunningServiceTask : 1600690053191: START - Sleep for 6 min.
2020-09-21 08:13:33.191 INFO 17320 — [aTaskExecutor-1] slow.task.LongRunningServiceTask : 1600690053191: END - Sleep is done.
DB
TIMESTAMP_UTC_24H: 2020-09-21 08:13:33
JOB_STATE_: 1
JOB_RETRIES_: 1
ENGINE-03083 Exception while executing Batch Database Operations with message ’ ### Error flushing statements. Cause: org.apache.ibatis.executor.BatchExecutorException: org.camunda.bpm.engine.impl.persistence.entity.VariableInstanceEntity.deleteByteArrayNoRevisionCheck (batch index #5) failed. 4 prior sub executor(s) completed successfully, but will be rolled back. Cause: org.h2.jdbc.JdbcBatchUpdateException: Referential integrity constraint violation: “ACT_FK_JOB_EXCEPTION: PUBLIC.ACT_RU_JOB FOREIGN KEY(EXCEPTION_STACK_ID_) REFERENCES PUBLIC.ACT_GE_BYTEARRAY(ID_) (‘30fa85ef-fc02-11ea-ac5d-0205857feb80’)”; SQL statement:
delete from ACT_GE_BYTEARRAY where
Question-5.1
Why this execution doesn’t wait into 5th minute of step 4 execution but instead executes just after step 3 finishes?
Question-5.2:
Why here the exception is not like in step 1 and 2 (OptimisticLockException) ?
step 6)
LOG
2020-09-21 08:12:48.616 INFO 17320 — [aTaskExecutor-2] slow.task.LongRunningServiceTask : 1600690368616: START - Sleep for 6 min.
2020-09-21 08:18:48.617 INFO 17320 — [aTaskExecutor-2] slow.task.LongRunningServiceTask : 1600690368616: END - Sleep is done.
DB
TIMESTAMP_UTC_24H: 2020-09-21 08:18:48
JOB_STATE_: 1
JOB_RETRIES_: 0
JOB_EXCEPTION_MSG_: ENGINE-03083 Exception while executing Batch Database Operations with message ’ ### Error flushing statements. Cause: org.apache.ibatis.executor.BatchExecutorException: org.camunda.bpm.engine.impl.persistence.entity.VariableInstanceEntity.deleteByteArrayNoRevisionCheck (batch index #5) failed. 4 prior sub executor(s) completed successfully, but will be rolled back. Cause: org.h2.jdbc.JdbcBatchUpdateException: Referential integrity constraint violation: “ACT_FK_JOB_EXCEPTION: PUBLIC.ACT_RU_JOB FOREIGN KEY(EXCEPTION_STACK_ID_) REFERENCES PUBLIC.ACT_GE_BYTEARRAY(ID_) (‘30fa85ef-fc02-11ea-ac5d-0205857feb80’)”; SQL statement:
delete from ACT_GE_BYTEARRAY where
Question-6.1
Same 2 questions like in 5.1 and 5.2