Zeebe worker Circuit breakers

Are there any circuit breakers available out of box in Zeebe? If a worker is generating fail commands above a specified threshold rate, I would like the jobs handled by the worker to be paused for a specified time and resume later. is it possible to achieve this in Zeebe?

If fail command generated with parameter count every time, it will be infinite cycle. Do not pass with parameter and default value applied (3 attempts if i remember right) and then incident

This is an interesting idea. This might happen if a particular worker instance has lost its connectivity to a dependent API. In that case, you might want that worker to detect its own uselessness, and throttle or circuit break.

An issue with this is that your system will not be intelligible. If all workers for a task type circuit break, rather than the dependent API surfacing as the problem via incidents, you end up with processes simply halting, and timing out…

There is nothing like this in any of the client libraries at the moment.

You could code this behaviour into a worker pretty easily. You create a sliding window, and do two things: wrap the entire worker handler to capture any unhandled exceptions, and wrap the failure completion method. In your decorator you update the sliding window, then execute the normal failure completion.

Hi MaximMonin, Thank you for your response. I do specify the retry count in the fail command which creates an incident after the retries are exhausted.

Consider the following scenario,

My Zeebe worker is calling a downstream service which is currently unavailable for few minutes due to a transient issue. The workers calling the downstream system will fail and after all the retries are exhausted, the incident is created. In a high volume system , I would not want to create thousands of incidents and retry the incidents manually. Instead, I would like to embed a circuit breaker so the workers will pause processing for a preconfigured time and resume the processing later.

Thank you for your response.

My thought is “a specific zeebe worker (even in case of multiple instances of a worker running on different nodes) will not accept any new jobs for a configurable time limit when the specified zeebe worker is producing incidents beyond the threshold rate.”

could you please explain more on the behavior you suggested? Since a zeebe worker is stateless and it has knowledge only about the unit of work/job currently being processed, how to determine the failure rate for a specified worker type?

A worker is stateless with respect to the processes in Zeebe, but they are not stateless with respect to their own connections to other systems - that’s how exponential backoff / retry can be implemented.

You asked:

Since a zeebe worker is stateless and it has knowledge only about the unit of work/job currently being processed, how to determine the failure rate for a specified worker type?

The situation is that a Zeebe worker has knowledge only about the units of work/job that it processes, so it can maintain a state in a sliding window to determine its own failure rate.