If the job is not completed or failed within the configured job activation timeout, Zeebe reassigns the job to another job worker. This does not affect the number of remaining retries.
A timeout may lead to two different workers working on the same job, possibly at the same time. If this occurs, only one worker successfully completes the job. The other complete job command is rejected with a NOT FOUND error.
The fact that jobs may be worked on more than once means that Zeebe is an “at least once” system with respect to job delivery and that worker code must be idempotent. In other words, workers must deal with jobs in a way that allows the code to be executed more than once for the same job, all while preserving the expected application state.
As the docs shows, when a job is not completed or failed within the configured job activation timeout, it will be automatically retried again,which means our business logic may execute many times. In our cases, it’s hard to guarantee the idempotentence of all job.
So is there anyway to disable auto retry when job is timeout?
first of all, you should measure how long your task will take at a max and adjust the timeout value of your request according to it.
If this value varies too much or is too high, you could implement a worker-side state that tracks the executed jobs. My assumption is here that your workers are scaleable. The new worker picking up a job would then be able to check if the task is still in progress.
The auto retry cannot be disabled. The reason is that a worker could crash during execution, leaving a task locked forever. The timeout prevents this.
you could achieve this by letting a task fail without retries explicitely in your worker code before the task times out. It is possible, but please also test this behaviour under load. It could lead to more admistrative overhead.
letting a task fail without retries explicitely in your worker code before the task times out
Seems help. But that means I need write the code in every Job Worker. I hope the broker can fail the job dirrectly when timeout or in a configurable way.
Maybe I can open an issue to see whether others meet the same case with me?