Is there a possibility of data loss in case of zeebe cluster failure?

As per my understanding, if a broker (holding leadership of a partition) in the Zeebe cluster fails, as per RAFT protocol, the leadership will be taken up by another broker. In this case due to replication, all the latest commands/events will be available with the new leader.

If we consider a scenario where the whole Zeebe cluster (all broker nodes) goes down, when the cluster is live again, it will use the RocksDB snapshots stored in the PV to recreate a RocksDB instance from the snapshot. If we consider the snapshot interval is 5 minutes, if commands were executed after the previous snapshot, will those events be lost on cluster failure?

I asked @deepthi - one of the Zeebe developers - about this and she said:

Zeebe’s state is stored in snapshot + raft logs. All commands will be written to the raft log and committed before they are executed. After a restart, Zeebe starts with the previous snapshot and the raft log with the committed commands/events. Zeebe can re-apply all events (in a deterministic manner) to rebuilt the state before restart. So there is no possibility of data loss.
Data loss can only happen if the data is explicitly deleted from the disk of any broker.

Hope that helps.

5 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.