Tell Camunda to pick up workflow where it left off after a system crash?

Hello,

I’m fairly new to Camunda and was wondering: does Camunda support any sort of post-instance-crash workflow recovery? I wasn’t able to find anything on it after looking at the documentation for hours.

Say that my Camunda flow has 5 Java service task steps: A, B, C, D, and E. All of these steps are connected in Camunda modeler linearly, and all of these steps must complete in order for the final state of the database to be valid. In other words, either all of them complete, or none of them should complete. Suppose that I run the workflow, and it runs up until finishing C, but before D can start, the instance running Camunda crashes. Currently, if I mimic this and restart the Camunda instance, Camunda acts like nothing is wrong, and the database has everything up to C populated, but the data from D onwards is missing. Again, this state is now invalid, since D and E must be present for A-C to be valid.

Is there some Camunda configuration, or perhaps a Camunda plugin needed to allow Camunda to pick back up where it left off when the Camunda instance is restarted? Thanks.

There is default functionality that helps with this: asynchronous continuations. Camunda will execute as much as possible in a single transaction by default. It will execute unitl the next wait state, like a user task or timer event. If you have 5 service tasks in a row, they will all be executed until the final one has completed successfully and only then will any state be stored.

That’s if you don’t configure the model explicitly. Let’s say that you want to execute A, B and C as a single transactional unit, and D and E together. You should configure task D as asynchronous before in that case (or C as asynchronous after, for that matter). This will make the transaction boundary for the first execution up to and including task C. A job will be created to continue with task D, and, without any other asynchronous markers, include E as well, after which the second transaction will commit.

So, long story short, you can control these boundaries in the process model to match exactly the behaviour that you need.

Restart capabilities are also built-in, because once the state is stored in the database and there is still work to be done, it will be resumed on restart by the JobExecutor, which takes care of all of the background work.

Note that if you have 5 tasks that are all-or-nothing, you should also have an asynchronous before on the very first one. In most cases that is what you want, even if they are the only tasks in your process. Without that marker, if there is a crash during the execution, you will find NO work left to resume because even the creation of the process instance was not committed before all of the tasks had been executed. With the marker in place, the state is first committed before any of the tasks are attempted.

More info:

1 Like

Thanks for the response, very informative. One last thing- you mentioned that if I mark A with async before, Camunda will keep track of the steps it has completed, and will resume them when the instance restarts. Does this resume mean it will literally restart at the last step which did not finish, or does it mean it will try the workflow again from the async before point onwards? Thanks.

Sharing the Camunda Blog link, which might be useful further to @tiesebarrell solution.

1 Like

It means it will resume from the last known correct state that was committed to the database. Any task that was “in transit” as part of an open transaction will not have been completed correctly and will be executed again. This is because without a transaction boundary that is either required (wait states) or configured (async markers), there is no intermediate storage, so there is also no state to continue from.

image

@tiesebarrell Task instances stuck in ACTIVE state.

How to Reproduce

  • we have one pod for the Camunda java spring boot app
  • Triggered one process instance.
  • While it was executing, restarted the Pod.
  • But now the pod is up but not picking up the execution and is stuck in an ACTIVE state.
  • Using async-before and async-after for all service tasks
  • Using async-after only.
  • Using external Postgres DB

Update
After exactly 30mins, tasks picked up again.
we have 30min configuration for: camunda.bpm.job-execution.lock-time-in-millis: 1800000

It is expected behavior by job-executor because when the pod went down, the task state committed was ACTIVE, So when a new Pod came up it was not having a context of these ACTIVE tasks so it was stuck in an ACTIVE state. After 30mins new instances were created by the job executor as per lock-time-in-millis config.

But how can we prevent this? I mean what config is required in the BPMN diagram such that the new Pod can immediately pick up the execution instead of waiting for 30mins?

@tiesebarrell @StephanHaarmann please help here.

Hi @Vivek_Korat,

Firstly, I’d appreciate if you create a new thread if you have a different question from the original one. Then link the thread if you consider it relevant. Older threads often slip our attention.

Second, you’re right. What you describe is intended behavior. The easiest way to resolve it is by reducing camunda.bpm.job-execution.lock-time-in-millis. Besides that, I cannot think of any configuration that would mitigate this issue.
Certainly, you could write your own extension, that updates the state of all jobs upon engine startup. But, I don’t consider that clean.
Do you really need a lock-time of 30 min?

can you please share such an example?

Yes, we need a lock-time of 30min as some tasks may run for more than 15mins, so we need to avoid multiple task executions due to lock-time.

Well, before going down this route, have you thought about using external task workers?
By using external task workers, you can decouple the implementation of service tasks from the process engine. In this case, a crash of Camunda is very unlikely. Furthermore, you can determine the lock duration per task, which gives you more flexibility.

@StephanHaarmann Can you share an example of an external task worker?

You can find documentation on external task workers here:

Examples are available on Github, for instance