Hi,
We are using zeebe 0.22.5 version and had a 3-broker cluster setup in the production kubernetes.
We had the trouble of zeebe broker’s memory consumption kept increasing, so we have to proactively reboot the brokers before it goes OOM on a regular basis.
But last time when we rebooted our brokers, one broker cannot join back to the cluster after reboot.
We got the status as below:
"
Cluster size: 3
Partitions count: 3
Replication factor: 3
Brokers:
Broker 0 - zeebe-cluster-zeebe-0.zeebe-cluster-zeebe.zeebe-cluster.svc.local:26501
Partition 1 : Leader
Partition 2 : Leader
Partition 3 : Leader
Broker 2 - zeebe-cluster-zeebe-2.zeebe-cluster-zeebe.zeebe-cluster.svc.local:26501
Partition 1 : Follower
Partition 2 : Follower
Partition 3 : Follower
"
And broker-0 and broker-2 only had below logs periodically (about every 15min):
" RaftServer{raft-partition-partition-2} - Failed to install 1"
" RaftServer{raft-partition-partition-3} - Failed to install 1"
We also tried to clean up broker-1’s data directory and reboot broker-1, as we suspect broker-1’s data copy may corrupt, so expect data can be re-created from the other brokers during broker-1 startup. But there is no luck.
Can someone hint what might be the reason(s), and is there a safe approach to let broker-1 join back to the cluster?
Thanks,
Oliver