Hey @berndruecker,
Is Raft Consensus used within one partition or throughout the whole Zeebe cluster? I guess the latter?
the Raft consensus algorithm is used for a single partition. Every partition is an own Raft group and there is no interaction between different Rafts, even if they are part of the same topic. A raft always starts as a single node, the node which creates the partition itself. And then grows to a specific size by inviting other Zeebe cluster nodes. This size will later be known as the replication factor.
The creation of a topic will look something like this:
- user requests to create topic
foo
with 4 partitions and a replication factor of 3
- the broker who is at the moment leader for the
internal-system
topic partition 0 will then orchestrate the creation of the 4 partitions by choosing 4 brokers to locally create a partition
- note: this does not require more then a single broker in the cluster as a broker can host multiple partitions
- for every partition the chosen broker will create the partition locally and then start a new raft group for this partition
- the raft group leader broker then tries to find enough other brokers to fulfill the replication factor requirement of 3, which means it need 2 followers so the raft group size is 3
- note: this actually requires a cluster size of at least the replication factor as a single broker cannot be in the same raft group multiple times
So in summary the size of a Zeebe cluster is unrelated to a raft group size, except that it has to be at least the highest replication factor.
That means that the whole Zeebe cluster only works if > 50% of the nodes are available, correct?
The Zeebe cluster is basically always available. But the access to a specific topic partition is only available if the quorum of itâs raft members is available. Assume you have five nodes A, B, C, D, E. And a raft group for partition 1 with the members A, B and C. If D and E become unavailable partition 1 is still accessible. If also a single node of A, B or C is unavailable partition 1 is still fine. But if at least two nodes of A, B or C go down partition 1 will not be available anymore.
So I still donât understand how the Zeebe cluster now how many nodes are there
The information of how many nodes exist in a Zeebe cluster is distributed over the Gossip protocol. This is not done by Raft. The cluster orchestration uses the information shared over Gossip to find members for a Raft group.
So how is determined which number of nodes are needed for a quorum?
The quorum is always determined in a single raft group (partition) based on the current number of members of this raft group. This is not related to the cluster size.
âIn order to ensure safety, configuration changes must use a two-phase approach.â
These configuration operations are adding and removing members to a raft group. Not the Zeebe cluster. And are implemented in a way that only a single node can join or leave a raft group at the same time. This again is not related to brokers joining the Zeebe cluster. And also this is not related to scaling Zeebe. The only reason to add nodes to a raft group is to reach the replication factor. Or leave and add to replace a node. This are operations which will only happen at the start of a raft group or during maintenance. These operations are not necessary to scale Zeebe. To scale Zeebe you would add partitions to a topic.
The interesting part about auto scaling is how to add, spread and remove partitions of a topic in a Zeebe cluster. We donât have an answer yet for these questions. But nothing of that will change the size of a Raft group in the end, except that for moving partitions from one node to another it would require that a node is exchanged in the Raft group.
Hope that clarifies again a bit whatâs going on.
Cheers,
Sebastian