Resizing Zeebe

Dan Shapir: How can I restore from a backup when using Helm?

Ole Schönburg: Hey <@UMZ3N3961>,
unfortunately this is not automated or directly supported by our Helm charts. There’s also no specific guide on how to restore in an environment installed by Helm, only the generic <Backup and restore | Camunda Platform 8 Docs guides> which I believe you’ve already found.

Are there any specific questions I could help you with?

Dan Shapir: I am trying to resize our cluster and I saw that if you backup and then restore to a larger cluster (same partition size) it will work.
But can’t find a way to do it in an environment installed by Helm.

To be honest, being able to back-up and restore is essential anyway and needed…

Ole Schönburg: I understand, it’s not great that this is a manual process right now and not properly documented .

I see you already <Issues · camunda/camunda-platform-helm · GitHub an issue> in the helm chart repository, thank you for that!
I’ve internally forwarded your interested in this feature to product management as well.
If you have access to the Camunda Jira you could also think about creating a feature request there.

Dan Shapir: Ohh I’m fine with it being manual. The problem is that it’s not even doable.

Ole Schönburg: How so? Is there anything specific that doesn’t work?

Dan Shapir: Well, there is no way to run the restore script before the brokers…
I understood there might be some initContainer hack do-able. Not sure if it works and how.

Ole Schönburg: > I understood there might be some initContainer hack do-able. Not sure if it works and how.
Yeah, that’s what I meant with it being a manual process.

Dan Shapir: Ohh ok. So that’s OK (although problematic, as it will run each time the brokers loads up - bad…), and there is not even a minimal example.
I’d say that being manual might be fine (it’s not something you do everyday), but being 100% guess work, less so).

Ole Schönburg: Yeah, I understand. I’d not recommend using an always-on initContainer for this because the restore itself needs to be orchestrated and checked (did all brokers restore manually, …). We tested this with init containers and it works but we also removed the init containers after the restore is finished.

For disaster recovery a generic process could look like this:

  1. Stop all brokers (for example by scaling the statefulset down to 0)
  2. Run the restore app on each broker PVC with the correct config available (for example by manually spawning pods that mount the same config and PVC as the brokers)
  3. Check that all restores finished successfully
  4. Start all brokers again

Dan Shapir: @Ole Schönburg interesting, so which approach would you recommend, initContainer and later remove it, or try and attach to each PVC and run the broker restore against it.

Ole Schönburg: The latter because it gives you more control. The init container approach is only useful when building on top of it to automate some aspects of the restore process.

Dan Shapir: Do you have any tips on how to run it with the same configs easily? Create a helm chart from 0 that installs zeebe image and attaches to each pvc and then run it? Seems very manual as opposed to the initContainer that will do it for you.
I’m uncertain as to how to make sure the config are the same when ignoring the original helm. Meaning you suggest running it along side, basically having a restore chart that shares part of the values but only deploys the zeebe image and runs the restore command against it.
If so, I think it could be done in the same chart, just turning on a flag “restore: true” that if turned-on will run with the same configs, but this time only run the restore command, but that’s only possible if you edit the original helm charts (main one and zeebe one to support the flag)

Dan Shapir: @Ole Schönburg interesting thought, if I edit in the cluster to execute the restore command, restart all brokers and once done restore the original configmap and restart them again. Would that work?
If so, that’s a relatively easy solution that also makes sure it won’t allow the brokers to load and then get overwritten by another restore on restart.

Ole Schönburg: > Would that work?
Maybe, you’d have to try it out yourself. Keep in mind that you can also restore into a seperate, new zeebe cluster to test the restore process first without taking down production

Note: This post was generated by Slack Archivist from a conversation in the Camunda Platform 8 Slack, a source of valuable discussions on Camunda 8 (join here). Someone in the Slack thought this was worth sharing!

If this post answered a question for you, hit the Like button - we use that to assess which posts to put into docs.