Zeebe 0.23.4 in cluster: Long delay for starting jobs with zeebe-node clients

yoavsal · July 9, 2020, 7:49am

Hi,

Recently, we have moved our deployment from docker-compose to k8s cluster.
Since then, we observe a long delay (could be up to 30 seconds) until a worker starts to process a job.

Cluster configuration is:
$ zbctl --insecure status
Cluster size: 3
Partitions count: 3
Replication factor: 3
Gateway version: 0.23.4
Brokers:
Broker 0 - zeebe-zeebe-0.zeebe-zeebe.default.svc.cluster.local:26501
Version: 0.23.4
Partition 1 : Follower
Partition 2 : Follower
Partition 3 : Follower
Broker 1 - zeebe-zeebe-1.zeebe-zeebe.default.svc.cluster.local:26501
Version: 0.23.4
Partition 1 : Follower
Partition 2 : Leader
Partition 3 : Leader
Broker 2 - zeebe-zeebe-2.zeebe-zeebe.default.svc.cluster.local:26501
Version: 0.23.4
Partition 1 : Leader
Partition 2 : Follower
Partition 3 : Follower

We were using zeebe 0.23.2 and just moved to 0.23.4 hoping this issue is the reason: Long Polling is blocked even though jobs are available on the broker · Issue #4396 · camunda-cloud/zeebe · GitHub
Still, the problem persists.

Our clients are based on zeebe-node (same version)

This happens for the first job of the workflow instance and also before any next job is starting.

salaboy · July 10, 2020, 7:56am

@yoavsal are you using the Helm Charts from helm.zeebe.io? can you share which version?

Cheers

yoavsal · July 11, 2020, 12:45pm

Hi @salaboy,

we started off with zeebe 0.23.1 and used the charts available then.
Also, we might have edited the deployment files manually and through helm since then (we should align these).
Out current configuration is:
$ kubectl describe deploy zeebe-zeebe-gateway
https://gist.github.com/yoavsal/7deb98b5180209a40d672cff32deee55

$ kubectl describe statefulset zeebe-zeebe
https://gist.github.com/yoavsal/9cfc75999efd925eb32a7b1289334a94

jwulf · July 12, 2020, 4:11am

Hi @yoavsal, What happens when you run the same code against a local broker?

This will help isolate if this is a K8s issue, a Node client issue, or a combination of the two.

And, can you run against K8s with the env var ZEEBE_NODE_LOGLEVEL=DEBUG. This will give you more insight into what is happening.

Bohr · July 12, 2020, 4:13am

@yoavsal
I am interested in your zeebe-zeebe configmap, would you mind sharing it?I can not get my zeebe statefulset working for 0.23.4.

yoavsal · July 12, 2020, 8:54am

Here is the configmap:

$ kubectl describe cm zeebe-zeebe
https://gist.github.com/yoavsal/4ee8ea96b2557d4645ee923858e28bb5

yoavsal · July 12, 2020, 10:33am

On docker-compose setup the jobs are starting instantly.

Will try to get the DEBUG logs from the client as well.

yoavsal · July 12, 2020, 2:57pm

Hi @jwulf ,

The debug log outputs tons of messages.
I snapped the logs from broker and worker here:
https://gist.github.com/yoavsal/3c46a6fc88da7cb51bfd5f85206f0b4e

The first line of code is coming from the broker (from our custom exporter actually)
Then, after ~8 seconds the first log from zeebe-client debug is showing.

For our worker first line of code search for “REST2”

The last line of the log is again from our exporter when the workflow is completed.

BTW, the environment variable is ZEEBE_NODE_LOG_LEVEL (not ZEEBE_NODE_LOGLEVEL)

jwulf · July 13, 2020, 5:12am

Hmmm… ok, so an 8-second delay when using K8s, and no delay on a local broker.

Can you paste the worker logs from before your exporter sees the job?

jwulf · July 13, 2020, 5:31am

The other thing to do is to use a zbctl worker to isolate whether it is a broker issue, or an interaction with the client. See the Register a Worker section here: https://docs.cloud.camunda.io/docs/cli-zbctl

yoavsal · July 13, 2020, 8:48am

With cli-worker I also observed a delay. In some cases it was only 4 seconds but could go up to 10 or 20 seconds as well.

Dump of worker logs before job starts can be found here:
https://gist.github.com/yoavsal/bef518a0815c42c9032f8cc956f6c931

It starts when worker service is started and then follow execution of 2 workflow instances.
Instances IDs are: 2251799813723035 and 4503599627408321

First one show delay of ~4sec but the second shows 14sec

Thanks!

jwulf · July 13, 2020, 8:51am

OK, great. It’s a broker on K8s issue. Could it be CPU/Memory starvation?

yoavsal · July 13, 2020, 1:00pm

I don’s see any indication for starvation.
We actually increases the requested cpu and memory and have enough resources on the node.

Notice that the job processing seems to happen on some time cycle.
E.g. if I create several instances, it will take the delay until a first job is processed, but once it starts all the instances are handled immediately.
Could it be that long polling is not working as expected in gateway/cluster mode?

jwulf · July 13, 2020, 1:01pm

It could be. In the broker/gateway logs, if you set it to DEBUG level, you should see the job activation requests.

yoavsal · July 15, 2020, 11:46am

@jwulf, @salaboy
I have upgraded my helm charts from v0.0.110 to v0.0.128 and the delay seem to be gone .
I’m not sure what exactly fixed that at least now it is gone.
Thanks for the support

yoavsal · July 16, 2020, 5:56pm

Quick update:
The delays with 0.23.4 were better than before but after a day of use it is getting worse again.
At first, I suspected the JavaOpts that have the below list of options removed in v128 comparing to v100, but I’m not sure…

-XX:+UseParallelGC
-XX:MinHeapFreeRatio=5
-XX:MaxHeapFreeRatio=10
-XX:GCTimeRatio=4
-XX:AdaptiveSizePolicyWeight=90
-XX:+PrintFlagsFinal
-Xmx4g
-Xms4g

Zelldon · July 16, 2020, 7:55pm

Reason why we removed these lines are based on findings which are described here https://github.com/zeebe-io/zeebe/issues/4664