Zeebe 0.23.4 in cluster: Long delay for starting jobs with zeebe-node clients

Hi,

Recently, we have moved our deployment from docker-compose to k8s cluster.
Since then, we observe a long delay (could be up to 30 seconds) until a worker starts to process a job.

Cluster configuration is:
$ zbctl --insecure status
Cluster size: 3
Partitions count: 3
Replication factor: 3
Gateway version: 0.23.4
Brokers:
Broker 0 - zeebe-zeebe-0.zeebe-zeebe.default.svc.cluster.local:26501
Version: 0.23.4
Partition 1 : Follower
Partition 2 : Follower
Partition 3 : Follower
Broker 1 - zeebe-zeebe-1.zeebe-zeebe.default.svc.cluster.local:26501
Version: 0.23.4
Partition 1 : Follower
Partition 2 : Leader
Partition 3 : Leader
Broker 2 - zeebe-zeebe-2.zeebe-zeebe.default.svc.cluster.local:26501
Version: 0.23.4
Partition 1 : Leader
Partition 2 : Follower
Partition 3 : Follower

We were using zeebe 0.23.2 and just moved to 0.23.4 hoping this issue is the reason: Long Polling is blocked even though jobs are available on the broker · Issue #4396 · camunda-cloud/zeebe · GitHub
Still, the problem persists.

Our clients are based on zeebe-node (same version)

This happens for the first job of the workflow instance and also before any next job is starting.

@yoavsal are you using the Helm Charts from helm.zeebe.io? can you share which version?

Cheers

Hi @salaboy,

we started off with zeebe 0.23.1 and used the charts available then.
Also, we might have edited the deployment files manually and through helm since then (we should align these).
Out current configuration is:
$ kubectl describe deploy zeebe-zeebe-gateway
https://gist.github.com/yoavsal/7deb98b5180209a40d672cff32deee55

$ kubectl describe statefulset zeebe-zeebe
https://gist.github.com/yoavsal/9cfc75999efd925eb32a7b1289334a94

Hi @yoavsal, What happens when you run the same code against a local broker?

This will help isolate if this is a K8s issue, a Node client issue, or a combination of the two.

And, can you run against K8s with the env var ZEEBE_NODE_LOGLEVEL=DEBUG. This will give you more insight into what is happening.

@yoavsal
I am interested in your zeebe-zeebe configmap, would you mind sharing it?I can not get my zeebe statefulset working for 0.23.4.

Here is the configmap:

$ kubectl describe cm zeebe-zeebe
https://gist.github.com/yoavsal/4ee8ea96b2557d4645ee923858e28bb5

On docker-compose setup the jobs are starting instantly.

Will try to get the DEBUG logs from the client as well.

Hi @jwulf ,

The debug log outputs tons of messages.
I snapped the logs from broker and worker here:
https://gist.github.com/yoavsal/3c46a6fc88da7cb51bfd5f85206f0b4e

The first line of code is coming from the broker (from our custom exporter actually)
Then, after ~8 seconds the first log from zeebe-client debug is showing.

For our worker first line of code search for “REST2”

The last line of the log is again from our exporter when the workflow is completed.

BTW, the environment variable is ZEEBE_NODE_LOG_LEVEL (not ZEEBE_NODE_LOGLEVEL)

1 Like

Hmmm… ok, so an 8-second delay when using K8s, and no delay on a local broker.

Can you paste the worker logs from before your exporter sees the job?

The other thing to do is to use a zbctl worker to isolate whether it is a broker issue, or an interaction with the client. See the Register a Worker section here: https://docs.cloud.camunda.io/docs/cli-zbctl

With cli-worker I also observed a delay. In some cases it was only 4 seconds but could go up to 10 or 20 seconds as well.

Dump of worker logs before job starts can be found here:
https://gist.github.com/yoavsal/bef518a0815c42c9032f8cc956f6c931

It starts when worker service is started and then follow execution of 2 workflow instances.
Instances IDs are: 2251799813723035 and 4503599627408321

First one show delay of ~4sec but the second shows 14sec

Thanks!

OK, great. It’s a broker on K8s issue. Could it be CPU/Memory starvation?

I don’s see any indication for starvation.
We actually increases the requested cpu and memory and have enough resources on the node.

Notice that the job processing seems to happen on some time cycle.
E.g. if I create several instances, it will take the delay until a first job is processed, but once it starts all the instances are handled immediately.
Could it be that long polling is not working as expected in gateway/cluster mode?

It could be. In the broker/gateway logs, if you set it to DEBUG level, you should see the job activation requests.

@jwulf, @salaboy
I have upgraded my helm charts from v0.0.110 to v0.0.128 and the delay seem to be gone .
I’m not sure what exactly fixed that at least now it is gone.
Thanks for the support

Quick update:
The delays with 0.23.4 were better than before but after a day of use it is getting worse again.
At first, I suspected the JavaOpts that have the below list of options removed in v128 comparing to v100, but I’m not sure…

-XX:+UseParallelGC
-XX:MinHeapFreeRatio=5
-XX:MaxHeapFreeRatio=10
-XX:GCTimeRatio=4
-XX:AdaptiveSizePolicyWeight=90
-XX:+PrintFlagsFinal
-Xmx4g
-Xms4g

Reason why we removed these lines are based on findings which are described here https://github.com/zeebe-io/zeebe/issues/4664

1 Like