We are using zeebe 8.5 in production. We are seeing lot of processes getting stuck. Some of them are stuck before executing a service task where we noticed that the workflow is hung even before the job worker code is called.
We are unable to identify a pattern for such issues still. I see a lot of issues reported on github for stuck processes with most of them in open state.
It is hard for us to reproduce such issues because different workflows are getting stuck.
I am looking for expert inputs that could help us identify the root cause and more importantly what are the workarounds that we could apply to make the workflow executions move ahead? Would increasing the cluster size help (our current cluster size for broker and gateway is 1 and the partition count is 5)? Would restarting the whole zeebe cluster, temporarily make the workflows execute? Anything else?
Are you running Self Managed on On Premise or cloud hosted?
Self-managed
You can refer this link where all the details are captured in one place.
Are you hinting towards the fact that the processes are getting stuck because of lack of resources? Increasing the number of brokers and partitions would resolve the issue?