Zeebe 8.4.x fails to start with "Failed to handle message, host is not a known cluster member" error

We are upgrading zeebe from 8.2.12 to the 8.4.0 (even 8.4.5 failed) but zeebe brokers errors out on start up with an error java.util.concurrent.CompletionException: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host dev-zeebe-0.dev-zeebe.default.svc:26502 is not a known cluster member

The helm chart version for 8.4.0 was 9.0.2. We even tried starting up 8.4.5 and got the same error. We also tried with the latest (8.5.0-alpha2) locally on my laptop with the below helm command and saw the same issue in the logs for zeebe broker 0

helm install dev camunda/camunda-platform --set identity.enabled=false --set optimize.enabled=false --set tasklist.enabled=false --set operate.enabled=false --set connectors.enabled=false --set zeebe.affinity.podAntiAffinity=null --set zeebe-gateway.affinity.podAntiAffinity=null --set global.identity.auth.enabled=false

The installation fails on AWS setup with ec2 instances and even locally on a laptop.

Probably I am missing something in the configuration. Any insights on what could be going wrong?

The helm values configuration file is

global:
 identity:
   auth:
     enabled: false     
 image:
   tag: 8.4.0

identity:
 enabled: false

optimize:
 enabled: false
 
tasklist:
 enabled: false


operate:
 enabled: false
 
elasticsearch:
 enabled: true
 image:
   repository: bitnami/elasticsearch
   tag: 8.3.2
 master:
   replicaCount: 1
   resources:
     requests:
       cpu: 1
       memory: 2Gi
     limits:
       cpu: 1
       memory: 2Gi

connectors:
 enabled: false

zeebe:
 clusterSize: 3
 partitionCount: 3
 replicationFactor: 1
 cpuThreadCount: 4
 ioThreadCount: 4
 logLevel: info
 retention:
   enabled: true
   minimumAge: 10d
 affinity:
   podAntiAffinity: null
 env:
   - name: ZEEBE_BROKER_EXECUTION_METRICS_EXPORTER_ENABLED
     value: "true" 
 pvcSize: 128Gi
 resources:
   requests:
     cpu: 1
     memory: 512Mi
   limits:
     cpu: 1
     memory: 512Mi

zeebe-gateway:
 replicas: 2
 affinity:
   podAntiAffinity: null
 env:
   - name: ZEEBE_GATEWAY_THREADS_MANAGEMENTTHREADS
     value: "4"
   - name: ZEEBE_GATEWAY_MONITORING_ENABLED
     value: "true"
 resources:
   requests:
     cpu: 1
     memory: 512Mi
   limits:
     cpu: 1
     memory: 512Mi

The complete stack trace is

2024-03-28 08:36:56.153 [] [atomix-cluster-heartbeat-sender] [] INFO 
      io.atomix.cluster.protocol.swim - 0 - Member added Member{id=2, address=dev-zeebe-2.dev-zeebe.default.svc:26502, properties={}}
2024-03-28 08:36:56.184 [Broker-0] [zb-actors-1] [] WARN 
      io.camunda.zeebe.topology.gossip.ClusterTopologyGossiper - Failed to sync with 2
java.util.concurrent.CompletionException: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host dev-zeebe-0.dev-zeebe.default.svc:26502 is not a known cluster member
        at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
        at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?]
        at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source) ~[?:?]
        at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
        at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
        at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$25(NettyMessagingService.java:626) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31) ~[guava-33.0.0-jre.jar:?]
        at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$26(NettyMessagingService.java:624) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?]
        at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
        at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
        at io.atomix.cluster.messaging.impl.AbstractClientConnection.dispatch(AbstractClientConnection.java:49) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]
        at io.atomix.cluster.messaging.impl.AbstractClientConnection.dispatch(AbstractClientConnection.java:30) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]
        at io.atomix.cluster.messaging.impl.NettyMessagingService$MessageDispatcher.channelRead0(NettyMessagingService.java:1109) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[netty-codec-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) ~[netty-codec-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) ~[netty-transport-classes-epoll-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:509) ~[netty-transport-classes-epoll-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407) ~[netty-transport-classes-epoll-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.104.Final.jar:4.1.104.Final]
        at java.base/java.lang.Thread.run(Unknown Source) ~[?:?]
Caused by: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host dev-zeebe-0.dev-zeebe.default.svc:26502 is not a known cluster member
        ... 22 more

I am getting exactly same error on 8.4.5 version. What I can see in Zeebe log is that the members are added and then the “Unknown Member” error kicks in.

 io.atomix.cluster.impl.DefaultClusterMembershipService - Started cluster membership service for member Member{id=0, address=camunda-zeebe-0.camunda-zeebe.camunda.svc:26502, properties={}}
      io.atomix.cluster.protocol.swim - 0 - Member added Member{id=1, address=camunda-zeebe-1.camunda-zeebe.camunda.svc:26502, properties={}}
      io.atomix.cluster.protocol.swim - 0 - Member added Member{id=2, address=camunda-zeebe-2.camunda-zeebe.camunda.svc:26502, properties={}}
java.util.concurrent.CompletionException: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host camunda-zeebe-0.camunda-zeebe.camunda.svc:26502 is not a known cluster member

@hhajian - While troubleshooting, I realised that this is logged as a warning. Workflow executions work fine despite these warnings. Not sure if there are any side effects.

Do workflows run in your setup too?

Hi @jgeek1 ,
I have noticed the same workflow executions work fine and it is a warning message not an error.
But wanted to understand if there will be any impact on the performance in the future.

We have been running some tests. For one of our sample workflows, we didn’t observe any performance impact. With 2 gateways, 5 brokers, 25 partitions we are able to achieve a throughput of 200 PI/s.

Have you done or running any performance tests @mv4701 @hhajian ?