Zeebe Cluster Communication Issues Due to K8s IP Reuse

We have deployed two Zeebe clusters in the same namespace of K8S, with each cluster consisting of three broker nodes. Now we have encountered an issue:

Assume the IPs of the three nodes in Cluster A are: IP-A-1, IP-A-2, IP-A-3
The IPs of the three nodes in Cluster B are: IP-B-1, IP-B-2, IP-B-3

After redeploying both Clusters A and B, due to the K8S resource reuse mechanism, the new IPs of the nodes in Cluster A became the IPs that originally belonged to a node in Cluster B, and the new IPs of the nodes in Cluster B became the IPs that originally belonged to a node in Cluster A. In other words, the new IPs of the three nodes in Clusters A and B might be (for example):

Cluster A: IP-B-1, IP-A-4, IP-A-5
Cluster B: IP-A-2, IP-B-4, IP-B-5

At this point, we noticed that after restarting, both Clusters A and B attempted to communicate with nodes in the other cluster. For example, Cluster A tried to communicate with the node at IP-A-2 in Cluster B, while Cluster B tried to communicate with the node at IP-B-1 in Cluster A. This caused some partition states in the clusters to remain unhealthy.

Subsequently, we shut down Cluster B and restarted Cluster A. At this point, the members of Cluster A returned to normal, but it continuously reported warnings:

[2025-09-15 19:19:47.993] [zb-actors-0] [{actor-name=HealthCheckService, actor-scheduler=Broker-2}] WARN

io.camunda.zeebe.broker.system - Partition-1 failed, marking it as unhealthy: HealthReport[componentName=Partition-1, status=UNHEALTHY, issue=HealthIssue[message=Transition from FOLLOWER on term 360 appears blocked, step StreamProcessor has been running for PT23M27.417S, throwable=null, cause=null, since=2025-09-15T11:19:47.992Z], children={ZeebePartitionHealth-1=HealthReport[componentName=ZeebePartitionHealth-1, status=UNHEALTHY, issue=HealthIssue[message=Transition from FOLLOWER on term 360 appears blocked, step StreamProcessor has been running for PT23M27.417S, throwable=null, cause=null, since=2025-09-15T11:19:47.992Z], children={}], RaftPartition-1=HealthReport[componentName=RaftPartition-1, status=HEALTHY, issue=null, children={}]}]

How should we restore the cluster in this situation?

Zeebe Version: 8.7.6

Hi there! :wave:

This is a classic issue with Zeebe cluster isolation in Kubernetes when IP addresses get reused across different clusters. Let me help you understand what happened and how to resolve it.

Root Cause Analysis

The issue you’re experiencing is due to cross-cluster communication caused by:

  1. IP Reuse: Kubernetes reassigned IPs from one cluster to another
  2. Lack of Cluster Isolation: Both clusters likely have the same clusterName configuration
  3. Shared Contact Points: Brokers are trying to communicate with nodes from the other cluster based on previously cached topology information

Immediate Recovery Steps

1. Restart Affected Broker Pods

Since you’re seeing “Transition from FOLLOWER on term 360 appears blocked, step StreamProcessor has been running for PT23M27.417S”, the StreamProcessor is stuck.

Action: Restart the specific broker pod hosting the unhealthy partition to unblock the StreamProcessor installation.

2. Clear Persistent State (if needed)

If restarting doesn’t help and you see persistent “Services not installed” messages:

  • You may need to clear the partition data on affected follower nodes (never on the leader)
  • The follower will then re-sync from the leader’s snapshot

Long-term Solution: Proper Cluster Isolation

To prevent this issue in the future, implement proper cluster isolation:

1. Use Different Cluster Names

Configure distinct clusterName for each Zeebe cluster:

# Cluster A
zeebe:
  broker:
    cluster:
      clusterName: "zeebe-cluster-a"
  gateway:
    cluster:
      clusterName: "zeebe-cluster-a"

# Cluster B  
zeebe:
  broker:
    cluster:
      clusterName: "zeebe-cluster-b"
  gateway:
    cluster:
      clusterName: "zeebe-cluster-b"

2. Separate Namespaces

Deploy each cluster in its own Kubernetes namespace for logical separation.

3. Configure Proper Contact Points

Ensure each cluster’s initialContactPoints only includes brokers from the same cluster:

# Cluster A brokers should only list Cluster A contact points
zeebe:
  broker:
    cluster:
      initialContactPoints:
        - cluster-a-broker-0.cluster-a-broker:26502
        - cluster-a-broker-1.cluster-a-broker:26502
        - cluster-a-broker-2.cluster-a-broker:26502

4. Network Policies (Optional)

Implement Kubernetes NetworkPolicies to restrict traffic between namespaces on ports 26500/26501/26502.

Recovery Procedure

  1. Shut down Cluster B (as you’ve already done)
  2. Restart all Cluster A broker pods to clear any cached topology from Cluster B
  3. Wait for cluster to stabilize - give it time to re-establish quorum and catch up
  4. Verify health using the health endpoint
  5. Reconfigure both clusters with proper isolation before redeploying Cluster B

Prevention

  • Always use unique clusterName values
  • Deploy clusters in separate namespaces
  • Use proper service discovery within each cluster
  • Consider using StatefulSets with stable network identities

The key insight is that Zeebe clusters need proper logical separation, not just physical separation. The clusterName configuration is crucial for preventing cross-cluster communication.

References:

Let me know if you need help with the specific configuration or if the issue persists after trying these steps!

Hey @jamesxql - I’m not sure if what the bot provided was helpful or not, but questions like this are almost always best served by opening a support ticket. There is a lot of environment specific information, and there is a lot of missing information that you might not want to share publicly (Helm values.yaml, topologies, cloud provider and network configuration, etc.), etc.

Thanks, how do I initiate a support ticket?

Great question! Here’s how you can open a support ticket:

For Camunda 8 SaaS Customers:

  • Log into your Camunda Console
  • Navigate to the Support section
  • Click “Create Support Ticket”
  • Provide details about your issue

For Camunda 8 Self-Managed Customers:

  • If you have a commercial license/support contract, you should have received support contact information
  • Use the support portal or email provided by your Camunda account team
  • Include your license/contract information when submitting

For Community Users:

  • Continue using this forum for community support
  • For complex deployment issues like yours, consider the GitHub repository if you suspect a bug

What to Include in Your Support Ticket:

  • Zeebe version: 8.7.6 (:white_check_mark: you already mentioned this)
  • Environment: Kubernetes details (version, cloud provider, etc.)
  • Deployment method: Helm charts, operators, etc.
  • Configuration files: Helm values.yaml (sanitized)
  • Logs: Full broker logs around the time of the issue
  • Network topology: How your clusters are deployed
  • Timeline: When the issue started and what triggered it

Since your issue involves complex networking, cluster topology, and potentially sensitive infrastructure details, a support ticket will allow you to share more detailed information securely and get personalized assistance from the Camunda support team.

If you’re unsure about your support entitlement, you can also reach out to your Camunda account representative or check your licensing agreement.

Hope this helps! :dart:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.