we are running a simple camunda setup of 2 engine nodes and 1 webapp node. The bpmn process consists of sending a mail and some data to another service in parallel. The external task workers use the camunda-external-task-client java dependency to fetch the tasks and process them.
In most of the cases everything works fine. But sometimes after having processed all the task work in the execute method of the ExternalTaskHandler class and when trying to execute the complete(externalTask), camunda returns a 404 with the following error:
"{"type":"RestException","message":"External task with id a831a090-4b89-11ec-993d-26559e49b303 does not exist"}"
How is that possible if the task id was received by the api 1 second before and processed by the same task handler?
Due to the exception thrown, after a few seconds, the worker retries the task (with the same id) and everything runs fine.
I really appreciate your help because I am running out of ideas. If you need any further info like code snippets, logs etc. please let me know.
thanks for your quick reply. I am running 1-10 workers (in a kubernetes cluster), the value may vary depending on load.
And yes, yesterday I found out that they were all using the same workerId, so I fixed it. But now I am also getting many of these:
TASK/CLIENT-01007 Exception while notifying a failure: The task's most recent lock could not be acquired
As said, I am using a parallel gateway with two parallel tasks. Could the asyncBefore in the tasks in the BPMN process be an issue?
unfortunately removing the asyncBefore was not the solution either
The corresponding error from the API is:
Failure of External Task 2f848000-4c50-11ec-8265-c2407b9df59e cannot be reported by worker 'app-ccb9d4fcc-r844be3d4cfc7-92bb-40c6-a2a5-f0db32614abc'. It is locked by worker 'app-ccb9d4fcc-ftdd698336fb0-b90e-4150-adac-af86cac89d78'
Seems that workers are processing tasks that are locked?
yea this is also my feeling because it is only happening on higher load
I am a little bit confused about this. Of course I could just ignore the message but when the above error occurs on complete() it means that the task work (e.g. send an email) has already been done. Meanwhile, the worker that originally locked the task also works on it, meaning that for instance for an email task the email would be sent twice. And from the logs I can see that this in fact happens.
My understanding was that with the lock mechanism Camunda is preventing exactly this. If not what can I do to avoid tasks being processed twice then? Can I like check for the lock status and stop the execution of the task beforehand?
as the log is full of parallel request logs, I extracted the most important lines for you and anonymised some data due to company internals.
As you can see in the first 4 lines two different workers do a fetchAndLock and receive the same task. Then worker A does some work and then completes the task and receives the error that it is locked (line 16). It is then running into a catch block to report failure which also fails due to lock. (line 24) As there is not try-catch in the catch block the NotAcquiredException is thrown to the process engine. Meanwhile, worker B processes and completes the locked task successfully.
I think I may have found the issue (by reading other forum articles over and over again ) . The lock time was set to 7s. Therefore e.g. for the attached log the first worker locked the task, worked on it but took more than 7s to complete. Another worker came in, locked the task and completed it successfully. The first worker tried to then complete but lost the lock and therefore failed completing. Is that plausible?
Not yet confirmed of course, I am still testing right now.
Seems to work now! The above mentioned errors are gone completely. I can now see other errors from the database that may be related to the long task work, but this is an application specific issue I guess.
Thanks for your help !
If one cannot say how much time a task may take sometimes 2s, sometimes on high load 2mins etc. how can I be sure that a task is never processed twice ? Should I set the lock time to a very high number ? Or is this not recommended?