Hi,
We are evaluating zeebe for one of our use case, we tried the below configs to evaluate
how zeebe works on replication and leader role transition, when one or more broker goes down
Config:
Cluster size 5
partition count 4
replication factor 2
Below is the status from zbctl
{
"brokers": [
{
"nodeId": 3,
"host": "zb-zeebe-3.zb-zeebe.default.svc.cluster.local",
"port": 26501,
"partitions": [
{
"partitionId": 4,
"role": "LEADER",
"health": "HEALTHY"
},
{
"partitionId": 3,
"role": "LEADER",
"health": "HEALTHY"
}
],
"version": "1.0.0"
},
{
"nodeId": 4,
"host": "zb-zeebe-4.zb-zeebe.default.svc.cluster.local",
"port": 26501,
"partitions": [
{
"partitionId": 4,
"role": "FOLLOWER",
"health": "HEALTHY"
}
],
"version": "1.0.0"
},
{
"nodeId": 0,
"host": "zb-zeebe-0.zb-zeebe.default.svc.cluster.local",
"port": 26501,
"partitions": [
{
"partitionId": 1,
"role": "LEADER",
"health": "HEALTHY"
}
],
"version": "1.0.0"
},
{
"nodeId": 2,
"host": "zb-zeebe-2.zb-zeebe.default.svc.cluster.local",
"port": 26501,
"partitions": [
{
"partitionId": 3,
"role": "FOLLOWER",
"health": "HEALTHY"
},
{
"partitionId": 2,
"role": "LEADER",
"health": "HEALTHY"
}
],
"version": "1.0.0"
},
{
"nodeId": 1,
"host": "zb-zeebe-1.zb-zeebe.default.svc.cluster.local",
"port": 26501,
"partitions": [
{
"partitionId": 1,
"role": "FOLLOWER",
"health": "HEALTHY"
},
{
"partitionId": 2,
"role": "FOLLOWER",
"health": "HEALTHY"
}
],
"version": "1.0.0"
}
],
"clusterSize": 5,
"partitionsCount": 4,
"replicationFactor": 2,
"gatewayVersion": "1.0.0"
}
Now, we scaled down the pods to 2,
we could able to deploy the workflow, minimum available server should be 2 (as per the quorum logic ), we assumed it is able to accept
Below is the cluster status now, able to understand that
{
"brokers": [
{
"nodeId": 0,
"host": "zb-zeebe-0.zb-zeebe.default.svc.cluster.local",
"port": 26501,
"partitions": [
{
"partitionId": 1,
"role": "LEADER",
"health": "HEALTHY"
}
],
"version": "1.0.0"
},
{
"nodeId": 1,
"host": "zb-zeebe-1.zb-zeebe.default.svc.cluster.local",
"port": 26501,
"partitions": [
{
"partitionId": 1,
"role": "FOLLOWER",
"health": "HEALTHY"
},
{
"partitionId": 2,
"role": "FOLLOWER",
"health": "HEALTHY"
}
],
"version": "1.0.0"
}
],
"clusterSize": 5,
"partitionsCount": 4,
"replicationFactor": 2,
"gatewayVersion": "1.0.0"
}
Our Doubts are?
- Why for partition 2 . nodeid 1 didn’t become a leader, as per the zeebe docs,1st node should have become leader for partition id 2, cmiiw?
- is Quorum value computed at cluster level or partitions level ?
- What happens in the above scenario, if job workers were in the middle of execution, and particular broker holding partition is down?
- In another scenario, we kept the partition as 2, and replication as 2 , cluster size as 5, but only 3 nodes were up, remaining two node’s rediness probe is returning 503, what could be the reason for this?
- In future, if we want to upgrade the zeebe cluster, what use cases we should consider with respect to resilience and fault tolerance
- we are planning the set up zeebe broker on aws, does zeebe cluster available as aws ami image, similar to elastic search?