One zeebe broker cannot join back to the cluster after reboot

Hi,

We are using zeebe 0.22.5 version and had a 3-broker cluster setup in the production kubernetes.
We had the trouble of zeebe broker’s memory consumption kept increasing, so we have to proactively reboot the brokers before it goes OOM on a regular basis.

But last time when we rebooted our brokers, one broker cannot join back to the cluster after reboot.
We got the status as below:
"
Cluster size: 3
Partitions count: 3
Replication factor: 3
Brokers:
Broker 0 - zeebe-cluster-zeebe-0.zeebe-cluster-zeebe.zeebe-cluster.svc.local:26501
Partition 1 : Leader
Partition 2 : Leader
Partition 3 : Leader
Broker 2 - zeebe-cluster-zeebe-2.zeebe-cluster-zeebe.zeebe-cluster.svc.local:26501
Partition 1 : Follower
Partition 2 : Follower
Partition 3 : Follower
"

And broker-0 and broker-2 only had below logs periodically (about every 15min):
" RaftServer{raft-partition-partition-2} - Failed to install 1"
" RaftServer{raft-partition-partition-3} - Failed to install 1"

We also tried to clean up broker-1’s data directory and reboot broker-1, as we suspect broker-1’s data copy may corrupt, so expect data can be re-created from the other brokers during broker-1 startup. But there is no luck.

Can someone hint what might be the reason(s), and is there a safe approach to let broker-1 join back to the cluster?

Thanks,
Oliver

Hey @oliver_zm

first of all welcome to the Camunda Cloud Community :tada:

Small counter question why do you still use 0.22.5 (was released Jul 7, 2020). The latest version would be 0.26.1, which I would recommend to use. We fixed several issues regarding long start up times etc. It sounds to me that the Broker 1 is trying to get the latest snapshot from the other partitions, which might take a while, depending on the resources it has and size of the snapshot (check the snapshot files, how many do you have?) The easiest to overcome this issue is to copy the data from the leader to this node.

Greets
Chris

Thanks @Zelldon for the advice.

We have been using 0.22.5 in production system for quite a while and there are a lot active workflow instances generated everyday. So it’s a bit cautious for us to decide a upgrade path with mimimal impact. The other reason is that we may expect a release that can resolve the memory consumption issues so we can do the upgrade in one shot.

To look down into upgrading to 0.26.1 as you suggested, is there any major area we should pay attention? Correct me if I’m wrong, we could upgrade zeebe broker directly from 0.22.5 to 0.26.1 without need to worry about the existing data stored on disk? Any graceful shutdown should be performed before upgrading? In the meantime, I believe we need to upgrade our applications to use the matching zeebe client sdk version. Also, we manually bundled a hazelcast exporter 0.7.1 jar into the broker, do we need to worry about the exporter compatibility?

Thanks,
Oliver

Hey @oliver_zm

thanks for providing more details here.

It is not supported to migrate in one go. You should do it step by step, but I think there have been also some breaking changes in between, so for you it might make sense to wait until you 1.0 is released. This is planned for Q2. 1.0 will also contain multiple breaking changes, so you would need to start from a fresh installation. What you could do is to run a separate cluster, where you start the new instances and on the old cluster you only allow to complete the old instances and at some point you can turn it of completely.

If you migrate to zeebe 1.0 you need to use clients which targeting 1.0. The same applies for hazelcast exporter.

Hope that helps.

Greets
Chris

Got it. Thanks @Zelldon !

Look forward to release 1.0.

1 Like

BTW, i have a side questions though, @Zelldon

We defined 3 partitions with 3 brokers for the cluster, but we never result in 1/1/1 leader separation pattern. It is either 2/1/0 or 3/0/0 pattern. Leaders are vulnerable to memory consumption speed. Is there a way to control this?

Thanks,
Oliver

Hey @oliver_zm

currently there is no way to change this leader distribution, but I think there is planned something to improve it on best effort basis.

Greets
Chris

Try this parameters: I noticed that interchange better with broadcast

      membership:
        # Configure whether to broadcast member updates to all members.
        # If set to false updates will be gossiped among the members.
        # If set to true the network traffic may increase but it reduce the time to detect membership changes.
        # This setting can also be overridden using the environment variable ZEEBE_BROKER_CLUSTER_MEMBERSHIP_BROADCASTUPDATES
        broadcastUpdates: true

        # Configure whether to broadcast disputes to all members.
        # If set to true the network traffic may increase but it reduce the time to detect membership changes.
        # This setting can also be overridden using the environment variable ZEEBE_BROKER_CLUSTER_MEMBERSHIP_BROADCASTDISPUTES
        broadcastDisputes: true

Thanks @MaximMonin . Will try.

I had issues with this release due to increased memory requirements on restart. When a broker needs to rebuild state, it uses more memory than when it is running, and this can cause a repeated OOM scenario.

So allocating more memory to the failed broker might be the thing in your scenario.

Indeed @jwulf , we ever had the trouble you stated. Depending on the partition data size stored on the disk at reboot, we have to tune/increase the JVM heap size, or even increase the physical server total memory for a successful reboot.

Now we keep monitoring the disk consumption as well as memory consumption of zeebe brokers, so that we can timely trigger a reboot and avoid OOM at startup.