Zeebe cluster mode is deployed, but high availability is not achieved

bugbugmaker · August 26, 2022, 10:11am

When referring to this document zeebe-docker-compose/docker-compose.yml at master · camunda-community-hub/zeebe-docker-compose · GitHub Deploy the Zeebe cluster, and the cluster newtopologyrequest result is

{
"partitionsCount": 1,
"replicationFactor": 2,
"brokers": [
{
"partitions": [
{
"leader": true,
"role": "LEADER",
"partitionId": 1,
"health": "HEALTHY"
}
],
"address": "10.0.1.114:26501",
"port": 26501,
"host": "10.0.1.114",
"nodeId": 1,
"version": "1.0.0"
},
{
"partitions": [

  ],
  "address": "10.0.1.115:26501",
  "port": 26501,
  "host": "10.0.1.115",
  "nodeId": 2,
  "version": "1.0.0"
},
{
  "partitions": [
    {
      "leader": false,
      "role": "FOLLOWER",
      "partitionId": 1,
      "health": "HEALTHY"
    }
  ],
  "address": "10.0.1.116:26501",
  "port": 26501,
  "host": "10.0.1.116",
  "nodeId": 0,
  "version": "1.0.0"
}
],
"gatewayVersion": "1.0.0",
"clusterSize": 3
}

When the leader node nodeid1 service is turned off, the cluster is unavailable。
How can you configure the cluster so that the normal use of the cluster will not be affected no matter whether the leader or the follow is suspended?

jwulf · August 29, 2022, 5:03am

By having three nodes. With only two nodes, your cluster goes under the quorum when it loses one node.

Josh

bugbugmaker · August 29, 2022, 9:56am

{
  "partitionsCount": 3,
  "replicationFactor": 3,
  "brokers": [
    {
      "partitions": [
        {
          "leader": true,
          "role": "LEADER",
          "partitionId": 1,
          "health": "HEALTHY"
        }
      ],
      "address": "10.0.1.129:26501",
      "port": 26501,
      "host": "10.0.1.129",
      "nodeId": 0,
      "version": "1.0.0"
    },
    {
      "partitions": [
        {
          "leader": false,
          "role": "FOLLOWER",
          "partitionId": 1,
          "health": "HEALTHY"
        },
        {
          "leader": false,
          "role": "FOLLOWER",
          "partitionId": 2,
          "health": "HEALTHY"
        }
      ],
      "address": "10.0.1.130:26501",
      "port": 26501,
      "host": "10.0.1.130",
      "nodeId": 1,
      "version": "1.0.0"
    },
    {
      "partitions": [
        {
          "leader": true,
          "role": "LEADER",
          "partitionId": 3,
          "health": "HEALTHY"
        }
      ],
      "address": "10.0.1.133:26501",
      "port": 26501,
      "host": "10.0.1.133",
      "nodeId": 4,
      "version": "1.0.0"
    },
    {
      "partitions": [
        {
          "leader": false,
          "role": "FOLLOWER",
          "partitionId": 2,
          "health": "HEALTHY"
        },
        {
          "leader": false,
          "role": "FOLLOWER",
          "partitionId": 3,
          "health": "HEALTHY"
        }
      ],
      "address": "10.0.1.132:26501",
      "port": 26501,
      "host": "10.0.1.132",
      "nodeId": 3,
      "version": "1.0.0"
    }
  ],
  "gatewayVersion": "1.0.0",
  "clusterSize": 5
}

If I close one of the leader services, node0, the entire Zeebe cluster cannot use the connection, and an exception occurs: executionexception: io.grpc Statusruntimeexception: unavailable: IO exception. How to build a Zeebe cluster? When one of the leaders is suspended, the service can still be accessed normally?

jwulf · August 29, 2022, 10:06am

Put a reverse proxy in front of it, with the three brokers as the upstream services. This will handle the failover to a working broker when one is down.

These brokers are running with embedded gateway. You could also create a dedicated gateway, but then you have a single point of failure still.

Josh

bugbugmaker · August 29, 2022, 10:15am

Do you mean that multiple nodes are targeted at brokers, gateways or single nodes, which makes it impossible to achieve high availability?

jwulf · August 30, 2022, 6:37am

Your client has to speak to a single, specific endpoint. If you make that one of the brokers, then if this broker fails, the cluster may still be operational, but your client cannot speak to it.

So you could add a standalone gateway for the cluster, but you still have a single point of failure.

If you put a reverse proxy in front of the cluster with multiple upstream services defined, then the proxy will handle the failover if one of the nodes falls over. The reverse proxy is the single point of failure, but then you could put round-robin DNS in front of that.

Camunda SaaS does it this way using Nginx.

Josh

sebakerckhof · January 19, 2023, 9:55am

As you note, this just shifts the single point of failure one step up. So why is that better than using round-robin DNS towards multiple zeebe gateways directly? That seems to only makes sense if we assume nginx to be more robust than zeebe gateway.

system · January 31, 2024, 10:25am