Zeebe startup issue

Quite often when doing an upgrade of Zeebe afterwards Zeebe is not performing well. It throws warnings in the gateway PODs like this and the performance is very bad. Retries mostly make all process instances complete but not at the usual rate (not even an acceptable one).

2022-03-29 08:56:36.810 [ActivateJobsHandler] [gateway-scheduler-zb-actors-1] WARN
io.camunda.zeebe.gateway - Failed to activate jobs for type core-end-event-v1 from partition 4
java.util.concurrent.TimeoutException: Request ProtocolRequest{id=784, subject=command-api-4, sender=172.16.14.31:26502, payload=byte[]{length=162, hash=-985044966}} to brix-zeebe-3.brix-zeebe.brix-infrastructure.svc:26501 timed out in PT15S
at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$sendAndReceive$4(NettyMessagingService.java:226) ~[zeebe-atomix-cluster-1.3.6.jar:1.3.6]
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.72.Final.jar:4.1.72.Final]
at java.lang.Thread.run(Unknown Source) ~[?:?]

Sometimes waiting a bit longer before creating process instances helps, sometimes it helps to restart the rollout of the statefulset (recreate the PODs 1 by 1), but sometimes nothing helps and I have to delete also the PVCs and start over again with an empty Zeebe and deploy my processes.

We’re still in testing mode but facing this in production would be a nightmare.

Hey @fvogl

sorry to hear that you have bad experience with zeebe. Could you please share how you setup and configure zeebe ? How does your normal workload look like? Do you have any metrics you can share which shows the observed behavior?

Greets
Chris

Hi @Zelldon

here you’ll find our Zeebe setup and also the logs of such a situation just from today morning.

Thanks, Friedrich

Hey @fvogl

I had a small look at your config and I would like to clarify some things first.

You’re using an replication factor of 2, which is a bit sub-optimal. Is there a specific reason for it? It is wise to use an odd number of replicas, since Zeebe uses raft. See as an explanation here distributed - Why is it recommended to create clusters with odd number of nodes - Stack Overflow

Quite often when doing an upgrade of Zeebe afterwards Zeebe is not performing well. It throws warnings in the gateway PODs like this and the performance is very bad.

How long does it happen until it recovers? This might be related to the replication factor of two. If one node goes down the partition will be not available, until the replica is back.

Greets
Chris

Hi @Zelldon

I just wanted to have minimum redundancy while not wasting too much resources in case one broker dies or gets killed during a node-by-node maintenance of our OpenShift cluster. But I’ll give it a try with replication factor set to 3 now.
Normally it either works from the beginning or immediately after a redeploy, but also saw it recovering by just waiting long enough (5+ minutes).

Thanks, Friedrich

Hi @Zelldon

I’ve change our config to use 3 replicas, and initially thought it fixed the problem, but today I had it again :frowning: . I’ll try to replicate with a higher log level and send logs again once I’ve captured them.

Thanks, Friedrich

1 Like

Hi @fvogl,

In Q3 we’re planning to work on the official support for OpenShift Helm Charts for Camunda 8. I wanted to validate the idea with you:

Proposition: For each release of new minor of Camunda 8.x self-managed to support all the minor version of the current major version (4.x) that are not end of life (Full Support and Maintenance Support status).
Example - This would mean to support versions 4.6-4.10, if we would release new minor version on July 1st 2022. New minor versions of OpenShift are officially supported for 18 months in total for a release (OpenShift Container Platform Life Cycle - Red Hat Customer Portal)

Would that make sense for you, taking into consideration your OpenShift policies?

Thanks,
Aleksander

1 Like

Hi @aleksander-dytko,

of course, that makes perfect sense.

Thanks, Friedrich

1 Like