Issue with jobs activation during partition rebalancing

Hi,
During execution of tests against my Zeebe cluster(8.4 version) i found an unexpected behavior. I have a test cluster with 3 nodes and 3 leader partitions per each node and JobWorker which pulls jobs from Zeebe gateway. JobWorker sends requests to a mock app. When partitions health goes down due to disk or network issues only for 1 host(i reproduced the same behavior using docker.pause()) i expect that it affects only 1/3 of traffic to the mock app mentioned before. However, it affects all traffic until rebalancing is finished. I have an assumption that it happens when JobWorker tries to pull jobs from all partitions(until max-jobs-active setting is reached) and hangs on unhealthy partitions due to high request timeout from gateway to broker(15 sec).
This is controlled by zeebe.gateway.cluster.requestTimeout setting. I tried to change this setting to 1 sec and the issue with traffic disappeared. However, i’m not sure that this is the right way to solve it cause decreasing this setting can introduce negative effects. Hopefully, it’s a known issue and someone can suggest an optimal configuration for that or any different ways how to deal with that.
Thank you

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.