We are using zeebe 8.5 in production. We are seeing lot of processes getting stuck. Some of them are stuck before executing a service task where we noticed that the workflow is hung even before the job worker code is called.
We are unable to identify a pattern for such issues still. I see a lot of issues reported on github for stuck processes with most of them in open state.
It is hard for us to reproduce such issues because different workflows are getting stuck.
I am looking for expert inputs that could help us identify the root cause and more importantly what are the workarounds that we could apply to make the workflow executions move ahead? Would increasing the cluster size help (our current cluster size for broker and gateway is 1 and the partition count is 5)? Would restarting the whole zeebe cluster, temporarily make the workflows execute? Anything else?
Kindly share your inputs as we are blocked.
Thanks.