Hi,
During execution of tests against my Zeebe cluster(8.4 version) i found an unexpected behavior. I have a test cluster with 3 nodes and 3 leader partitions per each node and JobWorker which pulls jobs from Zeebe gateway. JobWorker sends requests to a mock app. When partitions health goes down due to disk or network issues only for 1 host(i reproduced the same behavior using docker.pause()) i expect that it affects only 1/3 of traffic to the mock app mentioned before. However, it affects all traffic until rebalancing is finished. I have an assumption that it happens when JobWorker tries to pull jobs from all partitions(until max-jobs-active setting is reached) and hangs on unhealthy partitions due to high request timeout from gateway to broker(15 sec).
This is controlled by zeebe.gateway.cluster.requestTimeout setting. I tried to change this setting to 1 sec and the issue with traffic disappeared. However, i’m not sure that this is the right way to solve it cause decreasing this setting can introduce negative effects. Hopefully, it’s a known issue and someone can suggest an optimal configuration for that or any different ways how to deal with that.
Thank you