We close and rejoin a broker in 1-gateway-4-broker cluster to test Zeebe fault tolerance.
When the cluster is configured with 48 partititions, the closed broker failed rejoining the cluster. Errors logs as follows:
2020-07-15 12:19:53.660 [] [raft-server-3-raft-partition-partition-22] WARN io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-22} - AppendRequest{term=2, leader=3, prevLogIndex=2, prevLogTerm=2, entries=0, commitIndex=2} to 2 failed: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /x.x.x.x:26502
java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /x.x.x.x:26502
at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$19(NettyMessagingService.java:486) ~[atomix-cluster-0.23.3.jar:0.23.3]
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) ~[guava-28.2-jre.jar:?]
at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$20(NettyMessagingService.java:486) ~[atomix-cluster-0.23.3.jar:0.23.3]
at java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
at io.atomix.cluster.messaging.impl.ChannelPool.lambda$getChannel$4(ChannelPool.java:142) ~[atomix-cluster-0.23.3.jar:0.23.3]
at java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
at io.atomix.utils.concurrent.OrderedFuture.complete(OrderedFuture.java:368) ~[atomix-utils-0.23.3.jar:0.23.3]
at java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$bootstrapClient$31(NettyMessagingService.java:615) ~[atomix-cluster-0.23.3.jar:0.23.3]
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[netty-common-4.1.50.Final.jar:4.1.50.Final]
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[netty-common-4.1.50.Final.jar:4.1.50.Final]
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[netty-common-4.1.50.Final.jar:4.1.50.Final]
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[netty-common-4.1.50.Final.jar:4.1.50.Final]
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[netty-common-4.1.50.Final.jar:4.1.50.Final]
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[netty-common-4.1.50.Final.jar:4.1.50.Final]
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[netty-common-4.1.50.Final.jar:4.1.50.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:636) ~[netty-transport-native-epoll-4.1.50.Final-linux-x86_64.jar:4.1.50.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:655) ~[netty-transport-native-epoll-4.1.50.Final-linux-x86_64.jar:4.1.50.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529) ~[netty-transport-native-epoll-4.1.50.Final-linux-x86_64.jar:4.1.50.Final]
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465) ~[netty-transport-native-epoll-4.1.50.Final-linux-x86_64.jar:4.1.50.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) ~[netty-transport-native-epoll-4.1.50.Final-linux-x86_64.jar:4.1.50.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[netty-common-4.1.50.Final.jar:4.1.50.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.50.Final.jar:4.1.50.Final]
at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /9.139.128.176:26502
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124) ~[netty-transport-native-unix-common-4.1.50.Final.jar:4.1.50.Final]
at io.netty.channel.unix.Socket.finishConnect(Socket.java:243) ~[netty-transport-native-unix-common-4.1.50.Final.jar:4.1.50.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672) ~[netty-transport-native-epoll-4.1.50.Final-linux-x86_64.jar:4.1.50.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649) ~[netty-transport-native-epoll-4.1.50.Final-linux-x86_64.jar:4.1.50.Final]
... 6 more
While with 12 partititions, the closed broker rejoin the cluster successfully in just few minutes.
Does partition count affect the time the cluster need to recover from node failing?
If so, what’s the proper partition count when setting up a cluster, any suggestion?
More info:
Broker configuration:
-
version: 0.23.3
-
16 CPU Core/ 16 GB Mem / 500GB SSD pod
-
replication factor: 3
Gateway configuration:
-
version: 0.23.3
-
8 CPU Core/ 8 GB Mem