How to identify the root cause and workaround for stuck workflow executions?

We are using zeebe 8.5 in production. We are seeing lot of processes getting stuck. Some of them are stuck before executing a service task where we noticed that the workflow is hung even before the job worker code is called.

We are unable to identify a pattern for such issues still. I see a lot of issues reported on github for stuck processes with most of them in open state.

It is hard for us to reproduce such issues because different workflows are getting stuck.

I am looking for expert inputs that could help us identify the root cause and more importantly what are the workarounds that we could apply to make the workflow executions move ahead? Would increasing the cluster size help (our current cluster size for broker and gateway is 1 and the partition count is 5)? Would restarting the whole zeebe cluster, temporarily make the workflows execute? Anything else?

Kindly share your inputs as we are blocked.

Thanks.

Have you looked into Chaos link for Camunda? Are you running Self Managed on On Premise or cloud hosted?

You can refer this link where all the details are captured in one place.

Have you looked into Chaos link for Camunda?

No what’s that?

Are you running Self Managed on On Premise or cloud hosted?

Self-managed

You can refer this link where all the details are captured in one place.

Are you hinting towards the fact that the processes are getting stuck because of lack of resources? Increasing the number of brokers and partitions would resolve the issue?

Thanks

Camunda Chaos: Blog | Zeebe Chaos

Yes, Atleast 3 broker and 2 gateway is recommended to run in production.

If you would like to monitor current state during the process unresponsive, you can check the API available for gateway.