Clustering failover tests

dmcginnis · December 1, 2016, 6:10pm

Hi I’m in the process of testing the clustering setup, specifically testing failover scenarios where a process is started on one engine (instance A) and that engine is taken down before the process completes. My expectations is that the process engine on instance B would pickup the job and complete it. However, I’m not seeing this happen and I’m not certain if I have a design issue within the process definition or if I’ve misconfigured my processes engines. My understanding is that for the homogenous clustering setup to a single database, there isn’t much configuration required other than pointing each instance to that DB, so I’m leaning more toward a process definition error.

My process definition sends an http-connect request, receives the data and then utilizes a boundary event to wait a PT01 before progressing. For my failover testing, I’m killing instance A right after reception of the http-connect response, to test that timer event based threads will pickup on the failover to instance and continue.

Any help here would be great. I’ve attached my BPMN if that helps.

Thanks,
DominicAppraisalProcess.bpmn (12.8 KB)

Webcyberrob · December 1, 2016, 10:01pm

Hi Dominic,

Your understanding of cluster setup is correct. Camunda nodes just need to share a common database as process state is persisted in the database.

And this is a subtle point which needs to be understood. If you think about process state, a process transistions from state to state. A Camunda node will load the state of a process from the DB, execute the transitions until the next checkpoint, then persist the state back to the database. Thus if execution is killed before the state is flushed to the DB, then that transition will be lost.

In Camunda you have fine grained control over when the engine flushes process state to the DB. This is controlled by the async before/after flags. Hence an async continuation forces a flush of process state to the DB. In addition, some BPMN constructs (eg receive Task), implicitly force a flush of process state to the DB.

Hence in your case, I suspect you are starting the process instance and killing it before any flush to DB occurs. A process instance start does not necessarily mean it is persisted to the DB.

If you change your process model such that the Order Appraisal service task sets the Async Before flag, you will flush process state to the DB before the service call. Thus an alternate node will be able to resume the process.

Also note that the non interrupting timer will not be created until you actually enter the receive task. Its a boundary event on the receive task, not the process.

regards

Rob

dmcginnis · December 2, 2016, 10:40pm

Rob,
Thanks for the quick reply. I made the changes to my BPMN so that I set the Async Before flag on the Order Appraisal step and I am still seeing the issue I described above. Is there any queries I can use to look at the DB to send your way for review?

Thanks,
Dominic

Webcyberrob · December 3, 2016, 1:22am

Hi Dominic,

Perhaps a few more details then;

Does the process complete end to end in the cluster under normal circumstance?
How do you actually ‘Kill instance A’?
How do you know that you killed instance A at just the right moment?
Is your database on a separate host to your two engine nodes?

The behaviour you are after is actually part of the job executor system. Hence are you sure you have the job executor configured and started on both nodes?

regards

Rob

dmcginnis · December 5, 2016, 4:42pm

Rob,
I found the issue. I had the setup for heterogeneous cluster, when it should have been a homogeneous one. I made the change to set the jobExecutorDeploymentAware to false and it started working. It looks like, OOB the deployment is setup for heterogeneous, but I had thought that the default setup would be otherwise.

Added
false

and now it works.

Thanks,
Dominic