Issue in connecting Gateway

I have created a cluster with 4 broker nodes, 4 partition and 2 replication factor.

Java Zeebeclient is able to connect gateway, if all the partition has leader node.

If I bring broker-1 down, partition 1 and 4 became unhealthy. But partition 2 and 3 is healthy.


But if we try to reconnect java zeebeclient to gateway, we are getting below exception.

org.springframework.context.ApplicationContextException: Failed to start bean ‘zeebeClientLifecycle’; nested exception is io.camunda.zeebe.client.api.command.ClientStatusException: deadline exceeded after 9.959935277s. [closed=[], open=[[buffered_nanos=234615363, remote_addr=localhost/127.0.0.1:26500]]]
at org.springframework.context.support.DefaultLifecycleProcessor.doStart(DefaultLifecycleProcessor.java:181) ~[spring-context-5.3.23.jar:5.3.23]
at org.springframework.context.support.DefaultLifecycleProcessor.access$200(DefaultLifecycleProcessor.java:54) ~[spring-context-5.3.23.jar:5.3.23]
at org.springframework.context.support.DefaultLifecycleProcessor$LifecycleGroup.start(DefaultLifecycleProcessor.java:356) ~[spring-context-5.3.23.jar:5.3.23]
at java.lang.Iterable.forEach(Iterable.java:75) ~[?:?]
at org.springframework.context.support.DefaultLifecycleProcessor.startBeans(DefaultLifecycleProcessor.java:155) ~[spring-context-5.3.23.jar:5.3.23]
at org.springframework.context.support.DefaultLifecycleProcessor.onRefresh(DefaultLifecycleProcessor.java:123) ~[spring-context-5.3.23.jar:5.3.23]
at org.springframework.context.support.AbstractApplicationContext.finishRefresh(AbstractApplicationContext.java:935) ~[spring-context-5.3.23.jar:5.3.23]
at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:586) ~[spring-context-5.3.23.jar:5.3.23]
at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:734) ~[spring-boot-2.7.5.jar:2.7.5]
at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:408) ~[spring-boot-2.7.5.jar:2.7.5]
at org.springframework.boot.SpringApplication.run(SpringApplication.java:308) ~[spring-boot-2.7.5.jar:2.7.5]
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1306) ~[spring-boot-2.7.5.jar:2.7.5]
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1295) ~[spring-boot-2.7.5.jar:2.7.5]
at com.helloworld.Application.main(Application.java:57) ~[classes/:?]
Caused by: io.camunda.zeebe.client.api.command.ClientStatusException: deadline exceeded after 9.959935277s. [closed=[], open=[[buffered_nanos=234615363, remote_addr=localhost/127.0.0.1:26500]]]
at io.camunda.zeebe.client.impl.ZeebeClientFutureImpl.transformExecutionException(ZeebeClientFutureImpl.java:93) ~[zeebe-client-java-8.1.6.jar:8.1.6]
at io.camunda.zeebe.client.impl.ZeebeClientFutureImpl.join(ZeebeClientFutureImpl.java:50) ~[zeebe-client-java-8.1.6.jar:8.1.6]
at io.camunda.zeebe.spring.client.annotation.processor.ZeebeDeploymentAnnotationProcessor.lambda$start$7(ZeebeDeploymentAnnotationProcessor.java:119) ~[spring-zeebe-8.1.9.jar:8.1.9]
at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]
at io.camunda.zeebe.spring.client.annotation.processor.ZeebeDeploymentAnnotationProcessor.start(ZeebeDeploymentAnnotationProcessor.java:100) ~[spring-zeebe-8.1.9.jar:8.1.9]
at io.camunda.zeebe.spring.client.annotation.processor.ZeebeAnnotationProcessorRegistry.lambda$startAll$0(ZeebeAnnotationProcessorRegistry.java:38) ~[spring-zeebe-8.1.9.jar:8.1.9]
at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]
at io.camunda.zeebe.spring.client.annotation.processor.ZeebeAnnotationProcessorRegistry.startAll(ZeebeAnnotationProcessorRegistry.java:38) ~[spring-zeebe-8.1.9.jar:8.1.9]
at io.camunda.zeebe.spring.client.lifecycle.ZeebeClientLifecycle.start(ZeebeClientLifecycle.java:49) ~[spring-zeebe-8.1.9.jar:8.1.9]
at org.springframework.context.support.DefaultLifecycleProcessor.doStart(DefaultLifecycleProcessor.java:178) ~[spring-context-5.3.23.jar:5.3.23]
… 13 more
Caused by: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 9.959935277s. [closed=[], open=[[buffered_nanos=234615363, remote_addr=localhost/127.0.0.1:26500]]]
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) ~[?:?]
at io.camunda.zeebe.client.impl.ZeebeClientFutureImpl.join(ZeebeClientFutureImpl.java:48) ~[zeebe-client-java-8.1.6.jar:8.1.6]
at io.camunda.zeebe.spring.client.annotation.processor.ZeebeDeploymentAnnotationProcessor.lambda$start$7(ZeebeDeploymentAnnotationProcessor.java:119) ~[spring-zeebe-8.1.9.jar:8.1.9]
at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]
at io.camunda.zeebe.spring.client.annotation.processor.ZeebeDeploymentAnnotationProcessor.start(ZeebeDeploymentAnnotationProcessor.java:100) ~[spring-zeebe-8.1.9.jar:8.1.9]
at io.camunda.zeebe.spring.client.annotation.processor.ZeebeAnnotationProcessorRegistry.lambda$startAll$0(ZeebeAnnotationProcessorRegistry.java:38) ~[spring-zeebe-8.1.9.jar:8.1.9]
at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]
at io.camunda.zeebe.spring.client.annotation.processor.ZeebeAnnotationProcessorRegistry.startAll(ZeebeAnnotationProcessorRegistry.java:38) ~[spring-zeebe-8.1.9.jar:8.1.9]
at io.camunda.zeebe.spring.client.lifecycle.ZeebeClientLifecycle.start(ZeebeClientLifecycle.java:49) ~[spring-zeebe-8.1.9.jar:8.1.9]
at org.springframework.context.support.DefaultLifecycleProcessor.doStart(DefaultLifecycleProcessor.java:178) ~[spring-context-5.3.23.jar:5.3.23]
… 13 more
Caused by: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 9.959935277s. [closed=[], open=[[buffered_nanos=234615363, remote_addr=localhost/127.0.0.1:26500]]]
at io.grpc.Status.asRuntimeException(Status.java:535) ~[grpc-api-1.49.1.jar:1.49.1]
at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487) ~[grpc-stub-1.49.1.jar:1.49.1]
at io.grpc.internal.DelayedClientCall$DelayedListener$3.run(DelayedClientCall.java:470) ~[grpc-core-1.49.1.jar:1.49.1]
at io.grpc.internal.DelayedClientCall$DelayedListener.delayOrExecute(DelayedClientCall.java:434) ~[grpc-core-1.49.1.jar:1.49.1]
at io.grpc.internal.DelayedClientCall$DelayedListener.onClose(DelayedClientCall.java:467) ~[grpc-core-1.49.1.jar:1.49.1]
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) ~[grpc-core-1.49.1.jar:1.49.1]
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) ~[grpc-core-1.49.1.jar:1.49.1]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:744) ~[grpc-core-1.49.1.jar:1.49.1]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723) ~[grpc-core-1.49.1.jar:1.49.1]
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.49.1.jar:1.49.1]
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) ~[grpc-core-1.49.1.jar:1.49.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]

Hey @Manian_Manoharan

that is expected. You have configured a replication factor of 2.

This means you have the following partition distribution.

$ ./partitionDistribution.sh 4 4 2
Distribution:
P\N|	N 0|	N 1|	N 2|	N 3
P 1|	L  |	F  |	-  |	-  
P 2|	-  |	L  |	F  |	-  
P 3|	-  |	-  |	L  |	F  
P 4|	F  |	-  |	-  |	L  

Partitions per Node:
N 0: 2
N 1: 2
N 2: 2
N 3: 2

I guess you mean you bring Broker-0 down, since this is part of Partition one and four.

With an replication factor of 2 you can’t reach quorum anymore, which makes the partition unhealthy or unusable at this moment.

As recommendation I can say try to use an odd replication factor, even numbers don’t give you any benefit. See also this SO post. So use for example three as replication factor.

Hope that helps.

Other posts where I mentioned this:

Greets
Chris

@Zelldon ,
We made replication factor 2 to test the cluster is working fine even if the single partition is healthy. As per the above testing, cluster will work only if all partition are healthy.

is there any configuration to make zeebe cluster work even if single partition is healthy?

Hey @Manian_Manoharan

not sure whether I can follow you. You want to test whether Zeebe works even if one partition is unhealthy but the others are not. Is it this?

Greets
Chris

BTW sounds like you’re interested in chaos experiments maybe zbchaos Release Zbchaos v1.0.0 · zeebe-io/zeebe-chaos · GitHub is an interesting tool for you. It allows running chaos experiments with Zeebe clusters deployed in K8, more easily.

You mentioned But if we try to reconnect java zeebeclient to gateway, we are getting below exception. and posted then an exception which shows deadline exceeded after 9.959935277s.

What does reconnect mean here? Do you deploy a process model and wait for the response?

Greets
Chris

Reconnect means restarting our Java services

We are testing the cluster with atleast one partition is healthy

Your partition will only be available is >50% of the replicas the partition are available.
If you set replication to 2, and then remove one of the systems providing a replica, you now have exactly 50% available. Since you do not have >50% available, that partition is not available.

If you then try to start a process (I notice, like @Zelldon did that your error says “Spring-Zeebe” which indicates that you’re trying to use a client to access your cluster, not trying to restart the Zeebe node that you simulated failing) it will try to access the partition that is not available, and will time out.

Try setting the replication to 3, and repeat your tests where you remove one broker.

At all times, you must have (RoundDown(Replication Factor / 2 ) + 1) brokers available for each partition. For Replication:2 this works out to:
(RoundDown(Replication Factor / 2 ) + 1)
(RoundDown(2 / 2 ) + 1)
(RoundDown(1 ) + 1)
(1 + 1)
2

For Replication:3 this works out to:
(RoundDown(Replication Factor / 2 ) + 1)
(RoundDown(3 / 2 ) + 1)
(RoundDown(1.5 ) + 1)
(1 + 1)
2

Thanks @GotnOGuts :+1:

I added a warning to the startup process Add warning for even replication factor by Zelldon · Pull Request #11831 · camunda/zeebe · GitHub

Hope this helps for future users.

Greets
Chris

1 Like