External task with id ... does not exist

jschei · November 22, 2021, 7:14pm

Hey,

we are running a simple camunda setup of 2 engine nodes and 1 webapp node. The bpmn process consists of sending a mail and some data to another service in parallel. The external task workers use the camunda-external-task-client java dependency to fetch the tasks and process them.

In most of the cases everything works fine. But sometimes after having processed all the task work in the execute method of the ExternalTaskHandler class and when trying to execute the complete(externalTask), camunda returns a 404 with the following error:

"{"type":"RestException","message":"External task with id a831a090-4b89-11ec-993d-26559e49b303 does not exist"}"

How is that possible if the task id was received by the api 1 second before and processed by the same task handler?

Due to the exception thrown, after a few seconds, the worker retries the task (with the same id) and everything runs fine.

I really appreciate your help because I am running out of ideas. If you need any further info like code snippets, logs etc. please let me know.

Thanks!

Ingo_Richtsmeier · November 23, 2021, 8:13am

Hi @jschei,

hard to say what the reason is.

Something that popped up in my head: How many workers do you run? Did they provide different workerIds?

Hope this helps, Ingo

jschei · November 23, 2021, 8:31am

Hi Ingo,

thanks for your quick reply. I am running 1-10 workers (in a kubernetes cluster), the value may vary depending on load.
And yes, yesterday I found out that they were all using the same workerId, so I fixed it. But now I am also getting many of these:

TASK/CLIENT-01007 Exception while notifying a failure: The task's most recent lock could not be acquired

As said, I am using a parallel gateway with two parallel tasks. Could the asyncBefore in the tasks in the BPMN process be an issue?

Thanks!

jschei · November 23, 2021, 10:07am

unfortunately removing the asyncBefore was not the solution either

The corresponding error from the API is:

Failure of External Task 2f848000-4c50-11ec-8265-c2407b9df59e cannot be reported by worker 'app-ccb9d4fcc-r844be3d4cfc7-92bb-40c6-a2a5-f0db32614abc'. It is locked by worker 'app-ccb9d4fcc-ftdd698336fb0-b90e-4150-adac-af86cac89d78'

Seems that workers are processing tasks that are locked?

Ingo_Richtsmeier · November 23, 2021, 1:19pm

Hi @jschei,

No, as external service task are wait states and written to the database once they are reached.

After having a look at the implementation of the FetchExternalTaskCmd in the engine: camunda-bpm-platform/engine/src/main/java/org/camunda/bpm/engine/impl/cmd/FetchExternalTasksCmd.java at master · camunda/camunda-bpm-platform · GitHub
it seems to me that you have too many workers fighting for tasks. You can ignore the message, because in the meantime another worker locked the task and is working on it.

Hope this helps, Ingo

jschei · November 23, 2021, 1:39pm

Hi Ingo,

thanks for looking into this.

yea this is also my feeling because it is only happening on higher load

I am a little bit confused about this. Of course I could just ignore the message but when the above error occurs on complete() it means that the task work (e.g. send an email) has already been done. Meanwhile, the worker that originally locked the task also works on it, meaning that for instance for an email task the email would be sent twice. And from the logs I can see that this in fact happens.

My understanding was that with the lock mechanism Camunda is preventing exactly this. If not what can I do to avoid tasks being processed twice then? Can I like check for the lock status and stop the execution of the task beforehand?

Ingo_Richtsmeier · November 23, 2021, 2:27pm

Hi @jschei,

could you please share a complete stack trace where the message is shown?

Or provide more context when and where the messages appear?

Thank you, Ingo

jschei · November 23, 2021, 4:23pm

Hi @Ingo_Richtsmeier ,

as the log is full of parallel request logs, I extracted the most important lines for you and anonymised some data due to company internals.

As you can see in the first 4 lines two different workers do a fetchAndLock and receive the same task. Then worker A does some work and then completes the task and receives the error that it is locked (line 16). It is then running into a catch block to report failure which also fails due to lock. (line 24) As there is not try-catch in the catch block the NotAcquiredException is thrown to the process engine. Meanwhile, worker B processes and completes the locked task successfully.

log.txt (8.6 KB)

I have also added an anonymized/pseudo-like code of the ExternalTaskHandler Implementation.

ExternalTaskHandler.java.txt (1.5 KB)

Don’t hesitate to request more info, if needed.

Thanks !

jschei · November 24, 2021, 7:14am

I think I may have found the issue (by reading other forum articles over and over again ) . The lock time was set to 7s. Therefore e.g. for the attached log the first worker locked the task, worked on it but took more than 7s to complete. Another worker came in, locked the task and completed it successfully. The first worker tried to then complete but lost the lock and therefore failed completing. Is that plausible?
Not yet confirmed of course, I am still testing right now.

Ingo_Richtsmeier · November 24, 2021, 11:21am

Hi @jschei,

if the lock time is too short, the engines assumes that the worker died and another can fetch it.

Short locktime seems to be a good explanation for the effects that you saw.

You can also extend the lock time while youir worker handles the task: External Task Client | docs.camunda.org

Hope this helps, Ingo

jschei · November 24, 2021, 12:44pm

Hi @Ingo_Richtsmeier,

Seems to work now! The above mentioned errors are gone completely. I can now see other errors from the database that may be related to the long task work, but this is an application specific issue I guess.

Thanks for your help !

If one cannot say how much time a task may take sometimes 2s, sometimes on high load 2mins etc. how can I be sure that a task is never processed twice ? Should I set the lock time to a very high number ? Or is this not recommended?

system · January 30, 2024, 1:34pm