Zeebe RocksDB corruption in deployment

goose · October 5, 2023, 11:57am

Hi all,

We’ve recently deployed zeebe on a Kubernetes cluster on self-managed, using a pvc. It was all working fine, until one day we found the pod had restarted (it restarts all the time that’s no problem), but zeebe no longer starts up with the error message “Caused by: org.rocksdb.RocksDBException: Bad table magic number: expected NUM, found NUM in DB file name”

We’re currently in development so it’s not soo much of an issue to wipe the rocksDB partition, but if this happened in live service, we’d be in a sticky wicket.

Has anyone seen this before, and is there a recovery strategy we can use?