Anyway to disable auto retry when job is timeout?

Timeouts

If the job is not completed or failed within the configured job activation timeout, Zeebe reassigns the job to another job worker. This does not affect the number of remaining retries.

A timeout may lead to two different workers working on the same job, possibly at the same time. If this occurs, only one worker successfully completes the job. The other complete job command is rejected with a NOT FOUND error.

The fact that jobs may be worked on more than once means that Zeebe is an β€œat least once” system with respect to job delivery and that worker code must be idempotent. In other words, workers must deal with jobs in a way that allows the code to be executed more than once for the same job, all while preserving the expected application state.

As the docs shows, when a job is not completed or failed within the configured job activation timeout, it will be automatically retried again,which means our business logic may execute many times. In our cases, it’s hard to guarantee the idempotentence of all job.

So is there anyway to disable auto retry when job is timeout?

Hello @TangJiong ,

first of all, you should measure how long your task will take at a max and adjust the timeout value of your request according to it.

If this value varies too much or is too high, you could implement a worker-side state that tracks the executed jobs. My assumption is here that your workers are scaleable. The new worker picking up a job would then be able to check if the task is still in progress.

The auto retry cannot be disabled. The reason is that a worker could crash during execution, leaving a task locked forever. The timeout prevents this.

I hope this helps

Jonathan

1 Like

@jonathanlukas Thanks, I understand this.

Instead of auto retry, mark the timeout job as failed and raise some incident, leave the developer to make the decision. Is this a better way?

Hello @TangJiong ,

you could achieve this by letting a task fail without retries explicitely in your worker code before the task times out. It is possible, but please also test this behaviour under load. It could lead to more admistrative overhead.

Jonathan

@jonathanlukas Very grateful for your patience.

letting a task fail without retries explicitely in your worker code before the task times out

Seems help. But that means I need write the code in every Job Worker. I hope the broker can fail the job dirrectly when timeout or in a configurable way.

Maybe I can open an issue to see whether others meet the same case with me?

Hello @TangJiong ,

yes, this is an explicit implementation right now from your side.

You are welcome to open an issue on this if you want.

Jonathan

For these kinds of things, I would wrap the client with my own facade, and put common logic in the facade.

1 Like