Can anyone point me in the direction of a lighweight retry mechanism or solution for retrying failed tasks? Or provide some advice for handling such things?
For some added context a use case my organisation is facing quite often is retrying failed tasks. Typically we use one of the following approaches. Which one depends on the business needs of the given workflow.
- Don’t retry. Just fail the workflow (i.e. let the user/initiator sort out retries if needed)
- Manage retries downstream (typically for anything asynchronous, this will typically eventually result in a failed task)
However, more and more often we see the need for a lightweight retry mechanism for things like intermittent network outages, components restarting etc. I.e. things that are resolved in a matter of seconds (occasionally minutes) and where we don’t want to add the complexity of adding queues or similar.
We had hoped that the built in retry of failed tasks would fit the bill, but since the retry is immediate and with no option to specify a wait as far as I can tell, it is of limited value. We could obviously build something based on the job timeout (i.e simply not respond back to Zeebe when something fails and the increase the timeout for each attempt). Unless someone has a better solution, we might roll our own client wrapper doing this for us (we basically only need to persist the the number of attempts for a given task instance in our apps), and then fail the workflow when we have reached max attempts.