Suspended processes continue to run, and can't keep track of their progress

Hi,

I’m seeing the following:

  1. When I click suspend a process instance in Cockpit, the process keeps running.
  2. I let the process finish to the end state (an optimistic locking exception is seen here).
  3. If I then re-activate the process instance, the process instance starts over from the beginning.
  4. It then finishes normally if I leave it un-suspended.

I have tried many different combinations of putting aync continuation points at various parts of the process, but no matter what I do, I can’t get it to keep track of where it was when it was suspended. Basically I would have to put async continuations on every single task in the process, so that it would resume at the nearest point…

It seems like the worker job executor doesn’t keep track of its progress when suspended.

It seems like since Camunda knows which job executor is currently working on the process, it should be able to send some kind of signal to the JE, indicating it’s suspended. It’s not really feasible to require putting asyc continuations on every single task ahead of time, in anticipation that a process instance could be suspended at any time…

What can be done about this? I’m on Camunda version 7.4.

Thanks,
Galen

By the way, it also seems like the behavior I’m seeing is not following what is stated here:

https://docs.camunda.org/manual/7.4/user-guide/process-engine/process-engine-concepts/#suspend-process-instances

That doc says:

" when suspending a process instance, all tasks belonging to it will be suspended. Therefore, it will no longer be possible to invoke actions that have effects on the task’s lifecycle (i.e., user assignment, task delegation, task completion, …). "

However, I’m seeing tasks continue to be invoked, completed, etc…

Hi Galen,

This is intended behavior. The process engine uses optimistic locking to resolve conflicting updates to a logical entity and this is an instance of it. This is not restricted to suspension. For example, you get the same behavior if you delete the process instance in parallel while it is executed, if you update a variable while it is executed (and that variable changes), etc.

It seems like since Camunda knows which job executor is currently working on the process, it should be able to send some kind of signal to the JE, indicating it’s suspended.

The way this can be done is by using pessimistic locking instead of optimistic locking. E.g. think of a clustered setup where the job executor runs on a different machine than Cockpit. The only shared state in a Camunda cluster is the database. There is no direct communication between engine nodes. So this synchronization has to happen via the database and write locks (aka pessimistic locking) are the way to do that. There is no way this will be implemented in Camunda 7, as optimistic locking is a central design decision of the process engine. And it has a lot of benefits, such as making deadlocks very unlikely to occur. I also don’t see something like direct communication between cluster nodes happen, as this would complicate cluster setups very much.

It’s not really feasible to require putting asyc continuations on every single task ahead of time, in anticipation that a process instance could be suspended at any time…

Why not? You can even do that programmatically via a parse listener. No need to tick check boxes in modeler.

Cheers,
Thorben

Hi Thorben,

Thanks for the answers. This does make sense to me in terms of the system design. The only things I would want to followup on are:

  1. If there is any way to avoid re-running the last task when a process is suspended, then I would be interested in that solution. Right now, given the implementation, it seems impossible not to run the last task twice in the case of suspend, then activate.

  2. I’m interested in the parse listener approach. I’m assuming you would just add and “asyncBefore” in front of every task?

  3. I think the documentation needs to be a bit more clear that suspension will only suspend the process instance, at the next point the process instance needs to interact with the database (e.g. at an async continuation point). I think the wording about “token state” is correct, but it wasn’t immediately clear to me. :slight_smile:

Thanks for the help!
Galen

Hi Galen,

regarding re-run of tasks. I tend to use the following principles;

Make ‘transactions’ larger, eg aim for fewer check points (async continuations).
Use check points at natural boundaries. (These are also great suspend points)
If a service is not idempotent, then isolate it into a transaction on its own (eg async before & after).

In a distributed system you will always need to deal with failure. Thus even without the suspend behaviour you may still need to deal with re-run of tasks. Thus my strategy is minimise likelihood and consequence…

regards

Rob

Hi Rob,

I’m mostly concerned about the case where a service is not idempotent. Even if I mark async before and after, it still doesn’t help me. In other words, it will still get run again. I’m wondering why commit points fail (e.g. optimistic locking) to commit when the process is suspended. Isn’t there some sort of retry strategy that should be used to retry the commits until they work? I thought this is the whole concept of optimistic locking – you can keep trying to commit until you finally succeed… It seems to me that the async after point should still be able to be locked in, but subsequent tasks should not be able to run. This is what “suspend” intuitively means to me.

I guess what I’m saying is that “suspend” should be a shortcut for “force an async commit (and also stop processing) at the next/soonest possible point”. Does that make sense?

When you say “minimise likelihood” do you mean “minimise suspending instances”? Because as far as I can tell, the re-run always happens on suspend.

Thanks,
Galen

Hi Galen,
I am assuming that suspend is an exceptional circumstance. In addition, from an operational perspective I follow a drain stop. In other words, suspend all new but let inflight run to next checkpoint…

Under these circumstances, you shouldn’t get an optimistic locking exception…

Regards
Rob

Hi Rob,

Yes, that’s a good point. I think the drain stop approach will be a good one. I think you basically mean that we should suspend at the process definition level (but not cascade to instances), instead of suspending the instances.

Thanks,
Galen

Hi Galen,

Spot on - and if your process is divided into multiple 'transactions via async continuations, then you can suspend at the granularity of job definition. Thus you can suspend the jobs upstream and let downstream inflight process instances complete…