We have deployed two Zeebe clusters in the same namespace of K8S, with each cluster consisting of three broker nodes. Now we have encountered an issue:
Assume the IPs of the three nodes in Cluster A are: IP-A-1, IP-A-2, IP-A-3
The IPs of the three nodes in Cluster B are: IP-B-1, IP-B-2, IP-B-3
After redeploying both Clusters A and B, due to the K8S resource reuse mechanism, the new IPs of the nodes in Cluster A became the IPs that originally belonged to a node in Cluster B, and the new IPs of the nodes in Cluster B became the IPs that originally belonged to a node in Cluster A. In other words, the new IPs of the three nodes in Clusters A and B might be (for example):
Cluster A: IP-B-1, IP-A-4, IP-A-5
Cluster B: IP-A-2, IP-B-4, IP-B-5
At this point, we noticed that after restarting, both Clusters A and B attempted to communicate with nodes in the other cluster. For example, Cluster A tried to communicate with the node at IP-A-2 in Cluster B, while Cluster B tried to communicate with the node at IP-B-1 in Cluster A. This caused some partition states in the clusters to remain unhealthy.
Subsequently, we shut down Cluster B and restarted Cluster A. At this point, the members of Cluster A returned to normal, but it continuously reported warnings:
[2025-09-15 19:19:47.993] [zb-actors-0] [{actor-name=HealthCheckService, actor-scheduler=Broker-2}] WARN
io.camunda.zeebe.broker.system - Partition-1 failed, marking it as unhealthy: HealthReport[componentName=Partition-1, status=UNHEALTHY, issue=HealthIssue[message=Transition from FOLLOWER on term 360 appears blocked, step StreamProcessor has been running for PT23M27.417S, throwable=null, cause=null, since=2025-09-15T11:19:47.992Z], children={ZeebePartitionHealth-1=HealthReport[componentName=ZeebePartitionHealth-1, status=UNHEALTHY, issue=HealthIssue[message=Transition from FOLLOWER on term 360 appears blocked, step StreamProcessor has been running for PT23M27.417S, throwable=null, cause=null, since=2025-09-15T11:19:47.992Z], children={}], RaftPartition-1=HealthReport[componentName=RaftPartition-1, status=HEALTHY, issue=null, children={}]}]
How should we restore the cluster in this situation?
Zeebe Version: 8.7.6