Hi,
I am a bit curios on how Zeebe using RocksDB or any DB. I have read some posts that it does not use any database, but then I have also read some places its using RocksDB to store state in each partition.
Could some one please clarify or let me know some material to read through?
Also if there is some material on how the fault tolerance is working
Hi @sherry-ummen,
Zeebe does not use a centralized, external or shared database.
@deepthi gave a great talk about scaling workflow engines which discusses the problem of such databases in the first 2 minutes and then goes on to explain how Zeebe solves the problems of scaling and fault tolerance Scaling a distributed workflow engine (CamundaCon 2019) - YouTube.
You can also read more about our technical concepts here.
Lastly, to more directly answer your question about RocksDB. Zeebe does use a database and you’re right, that’s RocksDB. Each Zeebe broker is in charge of 1 or multiple partitions. Per partition it keeps track of a log of records and it keeps a view of the current state. As it processes records on the log it updates the view (or state) in RocksDB. If the broker goes down for what ever reason, another broker can take over the work (as long as the replicationCount is > 1), because the log was replicated. That broker just has to replay the events on the log to produce the same view of the state in its own RocksDB and can then start processing as leader of that partition. In reality, there’s a bit more to it, like creating snapshots to compact the log, replay on followers to speed up transitioning to leader role, etc, but this is the main idea about the log and the local state.
Hope that makes sense.
Nico
9 Likes
Perfect. That’s a great and concise reply.
Few questions still:
- Btw how does Zeebe handle if one of the Rocksdb gets corrupted?
- Is there some disk specifications requirement to make it work efficiently?
Thanks
Glad you appreciate it
how does Zeebe handle if one of the Rocksdb gets corrupted?
Corruption is a complex topic. We consider how data in Zeebe can get corrupted, rather than just looking at the database that holds the view of the current state, because that view can be rebuild from the records in the log. Zeebe has multiple safeguards against data corruption. First and foremost, a partition leader can step down when it becomes unhealthy, allowing a follower to take over. We verify that we can reach the exact same state by replaying events through property-based testing. And there are corruption safeguards like snapshot checksums, but I’m not the expert on these. And probably I’m overlooking some things here .
Is there some disk specifications requirement to make it work efficiently?
Faster is better , but we don’t provide specifications. I’d rather recommend testing the performance like you would any system: change one small configuration parameter and see if it makes a difference for your use case. If you’re using spinning disks, it might be worth to see if you can upgrade to SSDs. I highly recommend the grafana dashboard as a starting point to measure the impact on throughput, and/or latency.
Thank for another great reply.
Btw how do we backup and restore the state then?
Scenarios like we want to move to new cluster etc. if I recall Zeebe does not support increasing the number of broker or partitions once installed in a cluster. So in that cases we need to create a new cluster and then we need to also move old state to the new one. How do you solve such scenarios?
how do we backup and restore the state
A backup of a paused system is rather simple: just make a copy of the data folder of each broker. However, making a backup of a running system is hard due to the distributed nature of the system. This is something the team is interested in, but nothing for this has been planned yet.
we need to create a new cluster and then we need to also move old state to the new one. How do you solve such scenarios?
Yes such scenarios are sadly still painful. A feature request exists to support the scaling of a cluster Dynamic resize of an existing zeebe cluster · Issue #4391 · camunda/zeebe · GitHub, but until then we’ll have to deal with workarounds:
For example:
- run a new (larger) cluster next to the old cluster, until the old cluster has no active process instances anymore
- you can experiment with increasing the number of partitions (without increasing the number of brokers) by stopping the cluster, creating a backup of the volumes, increasing the partition count and rebooting the brokers. Although we don’t provide guarantees that this will work, it may work and we can recommend to just give it a try. Like I said, make sure to backup the data because we can’t guarantee that it will work.
- manually increase the number of brokers as described here zeebe capacity expansion, Is there an operation manual on how to migrate data during capacity expansion · Issue #7626 · camunda/zeebe · GitHub
Hope it helps
Hmm interesting, how people use it in the production env. If backup and restore is so difficult ? how camunda cloud handles this ?
For you a backup should generally be easy to do (given that you don’t change the cluster’s configuration like number of partitions, number of brokers, etc). Here’s what you do:
- stop your cluster by stopping each broker
- copy the data folder of each brokers to the data folder of the corresponding broker in your new cluster
- start up each broker in the new cluster
- the new cluster now continues from this backup
Hope it helps
1 Like