We have a workflow with a long running async system task making a HTTP call to external server. The external HTTP URL end point is stubbed to respond after few mins. When the process instance has started executing the system task we are restarting the Camunda engine and seeing if the process instance recovers. Our observation is the process instance is not moving forward and is in a hung state. Any pointers on how to recover the process instance? We are deploying Camunda as a container and PostgreSQL as the database.
Attaching the sample workflow we have used for testing.
long-running-service-task.bpmn (5.6 KB)
When the engine restarts, are you seeing a new connection being made to your external server (would need to monitor this through the network traffic or from your external server side)
After the engine restart, no new connections made to the external server.
How did you deploy the process definition? If you used the REST API or deployed from the modeller, then read on…
There is an engine setting called deployment aware. If this is set to true, the job executor will only execute jobs deployed to its node. If the app is deployed via the container, then on startup, each deployed app gets re-registered with the engine. For process definitions deployed via the REST API, on startup, there is no mechanism to register them with the engine. Hence the job executor will not pickup these jobs…
If this could be your problem. change the deployment aware setting to false and see if that fixies your immediate problem.
Thanks. We are adding the workflows via APIs. The deployment aware setting is already made false. Observation is only workflows with long running HTTP calls are not restarting.
@Mahesh_Doraiswamy can you try your same connection using Jsoup: Replacing Http-Connector with Jsoup usage
Implement the timeout feature as well as in the example. Just checking if this is a known issue with HTTP-connect and timeouts.
This issue is sorted out now. It takes around 5 mins after Camunda restart to initiate the process again. We didn’t wait long enough when doing testing. Thanks for all the inputs.
Did you identity why it takes 5 minutes?
Perhaps the server was stopped just after the job executor has locked the jobs, but before completion. Thus the job has a valid lease in the DB. If the server is restarted straight away, then the leases will still be valid and thus the new job executor instance will not acquire them until the lease has expired. Given the default lease is 5 minutes, about 5 minutes delay seems about right…