we are using Zeebe in production, but we are suffering problems that we cannot reproduce. Randomly one workflow execution gets blocked and the worker taking that jobType is not activated, but after aprox. 300 seconds, it is activated.
Our workers use Java Clients, and we have workers deployed in k8s cluster, but Zeebe is installed outside the cluster, in 3 MVs, one for gateway standalone and Elastic, other with 3 brokers and the last one with other 3 brokers. Our Zeebe cluster configuration is:
- cluster size: 6
- partitions: 4
- replication: 3
We don’t know why is this happening (no cause in logs of gateway or brokers), and we decided last week to move to a new Zeebe Cluster in version 0.26.1, but we are having the same behavior.
We have found this environment variable ZEEBE_BROKER_STEPTIMEOUT, which has 5 minutes as default value, and we don’t know if that is the reason to workers to activate 300 seconds later when the unknown error occurs.
Do you have any ideas about this random behavior?
Any idea is greatly appreciated, as we are receiving doubts about Zeebe.
Thank you in advance.