Multi region fail-over

andyh · January 24, 2020, 9:14pm

Hi
I’m currently in the process of deploying Zeebe to a AKS Kubernetes cluster and I’m wondering if you have any documentation or recommendations on how Zeebe supports MRF?

One option I can think of is to use Azure Managed Disks where a backup of the disks, used by Zeebe Brokers, are backed-up to a different region. In this scenario if the region went down we would have to restore the cluster in the stand-by region and restore the disks from back-up.

Is there anything I should be aware of that, perhaps isn’t immediately obvious, might make this option a non-starter?

Thanks
Andy

jwulf · January 24, 2020, 9:45pm

Hi @andyh, Zeebe doesn’t support Multi-Region Failover at this point, and that is not on the near-term roadmap.

As you intuit, any MRF set-up at this point would need to be jury-rigged. You could try the Managed Disks approach.

It really depends on how long your workflows run, and how often you back up the disk.

If your workflows are short-lived, then you may be better off storing the initial variables in a replicated database (deleting the record on workflow completion), and on failure, re-running them on a new cluster, handling idempotency in your workers. This gives you deterministic inconsistency and the opportunity of eventual consistency.

If your workflow instances are longer-running, then you probably want them to “start” where they were, on the new cluster - which - at this stage - would require replicating the event log to the new cluster.

But if you start them from some point in the past (a backup from a point in time before the cluster failed), then they start mid-workflow, but not necessarily where they were. You have indeterminate inconsistency. I’m not sure how you would deal with that…

jwulf · January 25, 2020, 12:48pm

Maybe GlusterFS on Kubernetes? FOSDEM 2017 - AMENDMENT Kubernetes+GlusterFS

Maybe not:

GlusterFS is latency dependent. Since self-heal checks are done when establishing the FD and the client connects to all the servers in the volume simultaneously, high latency (mult-zone) replication is not normally advisable.

It looks like Azure can only do scheduled backups: Application-consistent backups of Linux VMs - Azure Backup | Microsoft Learn

Google’s regional persistent disks provide a/synchronous replication and automatic fail-over: Build HA services using regional Persistent Disk | Compute Engine Documentation | Google Cloud. This could allow you to fail-over to another zone in the same region while maintaining the latest state of the cluster.

AWS EBS doesn’t look to have synchronous replication, it is check-point backups like Azure.

Update: I talked with Sagy Volkov (ScaleIO / Red Hat) about this today at DevConf.cz. He has been running latency/throughput tests at Red Hat for storage on Kubernetes. He says that you can do multi-zone async replication with Rook + Ceph. It has good throughput, but double the latency - and the latency is variable, depending on what is happening in the Cloud provider’s infrastructure and where your workload is actually located. A test with the same configuration, deployed in the same zones two hours apart can have different performance. He put it: “I don’t say that performance is bad, but that it is crazy. You can’t predict it.”

He also said that performance testing should have not only the average throughput / latency, but also focus on the 95th and 99th percentile - especially if your application has an SLA.

Further update: I’m in Jose Morales’ (formerly OpenShift, now VMWare) talk on K8s application development. He pointed me to Portworx as a solution that can do Zero-downtime DR:

Zero RPO DR
PX-DR offers RPO-zero failover across data centers in a metropolitan area in addition to HA within a single data center. Examples include, but are not limited to, Azure US East to AWS US East; Azure Germany Central to AWS Europe Frankfurt; Google Cloud asia-east2 to Azure East Asia; any AWS data center to Direct Connected Colo facility.

Continuous backup across the globe
For DR needs that span a country or globe, PX-DR offers continuous incremental-backups so that you can keep an up-to-date backup of your mission critical apps staged in case disaster strikes.

From this case study:

We are huge fans of open source software and before we found Portworx, we tested almost every free and open source product for running stateful containers and they couldn’t satisfy our high requirements in scalability, resilience, and security. We looked at Rook as a self-hosted Ceph cluster within Kubernetes, GlusterFS, OpenEBS, and Rancher Longhorn.

Open source version px-dev.

If I were going to try it, I would probably start by evaluating / testing this and the Google Regional Persistent disks (if Google Cloud were an option).

You have the same trade-off with performance / consistency that you have in any system. If you force synchronous commits to ensure consistency, you will pay a latency cost. If you asynchronously commit for performance, you have a window of inconsistency.

salaboy · January 27, 2020, 9:42am

@andyh are you by any chance using the Helm Charts? I am interested to make sure that they work in AKS, but I haven’t had the chance to try them myself.

Cheers

andyh · January 27, 2020, 9:45am

@jwulf thank you for such a thoroughly well constructed response.

Our workflows will need to be considered long running as most will involve human interaction during the lifetime of a workflow instance, which could mean that a workflow runs for longer than 24hrs.

If I have understood correctly, with Azure Managed Disks (as well as AWS) the backups are snapshots so there is always a window of inconsistency which could result in a workflow starting at an earlier (than expected) point and therefore repeating work that has already been completed, because of this we would have to ensure that any downstream service was idempotent?

If this was the case would the system not eventually become consistent or have I misunderstood something? I.e. Providing a snapshot of the event log (and running state e.g. RocksDB) was taken and all downstream services were idempotent the system would eventually become consistent.

I wouldn’t necessarily say that Google Cloud is not an option for us, but I think it would be fair to say that any solution (whatever that may be) would ideally be in Azure.

As mentioned our workflows are likely to be long running, due to human interaction, so performance in terms of latency might not be a big problem so long as the system was eventually consistent.

I will take a look into the tech you have provided links to and run what you have said by the team. Any decisions we make I’ll happily report back.

Thanks again.
Andy

andyh · January 27, 2020, 9:47am

Hi @salaboy
Yes I have used the Helm Charts and deployed used them to deploy Zeebe (and Operate) to AKS.

Ive also tried using them to deploy to Docker for Windows Kubernetes, which also worked but I have had difficulties getting the pods to run successfully.

salaboy · January 27, 2020, 10:11am

@andyh usually my recommendation if you have access to a cloud provider, is to not waste time working in local clusters. Usually, the differences are big in scheduling and the fact that most local clusters uses a single node topology makes it a very limited environment to work on.
In general if you really want to make it work locally, I would recommend Kubernetes KIND, as if you are going to work in a real life project you might want to use KIND to do integration tests of your Kubernetes Deployments inside of your CI/CD pipelines. Having said this, KIND will also have the same problems with PVCs and storage.

HTH