Zeebe worker Circuit breakers

Ajay_1 · April 28, 2021, 1:48am

Are there any circuit breakers available out of box in Zeebe? If a worker is generating fail commands above a specified threshold rate, I would like the jobs handled by the worker to be paused for a specified time and resume later. is it possible to achieve this in Zeebe?

MaximMonin · April 28, 2021, 4:53am

If fail command generated with parameter count every time, it will be infinite cycle. Do not pass with parameter and default value applied (3 attempts if i remember right) and then incident

jwulf · April 28, 2021, 12:40pm

This is an interesting idea. This might happen if a particular worker instance has lost its connectivity to a dependent API. In that case, you might want that worker to detect its own uselessness, and throttle or circuit break.

An issue with this is that your system will not be intelligible. If all workers for a task type circuit break, rather than the dependent API surfacing as the problem via incidents, you end up with processes simply halting, and timing out…

There is nothing like this in any of the client libraries at the moment.

You could code this behaviour into a worker pretty easily. You create a sliding window, and do two things: wrap the entire worker handler to capture any unhandled exceptions, and wrap the failure completion method. In your decorator you update the sliding window, then execute the normal failure completion.

Ajay_1 · April 28, 2021, 3:42pm

Hi MaximMonin, Thank you for your response. I do specify the retry count in the fail command which creates an incident after the retries are exhausted.

Consider the following scenario,

My Zeebe worker is calling a downstream service which is currently unavailable for few minutes due to a transient issue. The workers calling the downstream system will fail and after all the retries are exhausted, the incident is created. In a high volume system , I would not want to create thousands of incidents and retry the incidents manually. Instead, I would like to embed a circuit breaker so the workers will pause processing for a preconfigured time and resume the processing later.

Ajay_1 · April 28, 2021, 4:01pm

Thank you for your response.

My thought is “a specific zeebe worker (even in case of multiple instances of a worker running on different nodes) will not accept any new jobs for a configurable time limit when the specified zeebe worker is producing incidents beyond the threshold rate.”

could you please explain more on the behavior you suggested? Since a zeebe worker is stateless and it has knowledge only about the unit of work/job currently being processed, how to determine the failure rate for a specified worker type?

jwulf · April 28, 2021, 7:10pm

A worker is stateless with respect to the processes in Zeebe, but they are not stateless with respect to their own connections to other systems - that’s how exponential backoff / retry can be implemented.

You asked:

Since a zeebe worker is stateless and it has knowledge only about the unit of work/job currently being processed, how to determine the failure rate for a specified worker type?

The situation is that a Zeebe worker has knowledge only about the units of work/job that it processes, so it can maintain a state in a sliding window to determine its own failure rate.

system · January 31, 2024, 10:08am