Process deployment in clusters

Hello, we’re going to build application based on camunda, that will work on multiple clusters. As the application is business critical and must be accessible 24/7, the deployment scenario will be something like this.
Let’s say we will have 3 clusters.

  1. deploy new application to cluster 1, application on cluster 2 and 3 stay alive
  2. deploy new application to cluster 2, application on cluster 1 and 3 stay alive
  3. deploy new application to cluster 3, application on cluster 1 and 2 stay alive

As consequence, the new versions of processes can be deployed only after all clusters contain the same newest version of the application.
As we don’t have previous experience with camunda in prod environment, I’d like to ask, what mechanism would be best for process deployment. If possible, we’d like to avoid any rest calls in the deployment process.
2 possible mechanisms come to my mind.

  1. implement application version discovery in the environment(e.g. network discovery or table in DB), and if all existing application are at the newest version and same as the application currently being deployed, deploy the processes contained in the jar.
  2. after all clusters are deployed, run one more application which deploys the processes and immediately shuts itself down.

Is any of these mechanisms any good? Are there any other possible mechanisms?


This is a question many have struggled with, my own company included.

While the clustering model (basically a shared back end database) makes scaling relatively easy, it does present a problem with deployment and version control. The most common problem is that of class loading.

For example, if you have a 3 server cluster (we use WildFly so that’s what I going to use), you deploy process “Invoice” with it’s attendant Java classes to server 1. While this is going on, a request comes in to start the Invoice process. Camunda then begins to execute the workflow. One task on the workflow requires a Java class, but it is executed on server #2, to which you have not yet deployed the classes. Camunda is perfectly happy to execute it there because the BPMN code representing the workflow is available to all servers. However, it fails because the WildFly server doesn’t have the necessary classes.

You can mitigate some of this through the process.xml file so that activities won’t execute on servers where the workflow has not been formally deployed. However, that limits both your redundancy and scaling unless you want to impose a complicated scheme of management over that.

Camunda have suggested using error handling to catch class not found errors and have them retried.

You don’t really have version control issues if you’re only execute BPMN (i.e. no external Java classes) code because it becomes available instantly on all servers. So if your request comes in to server #1 and its execution is distributed to all three servers, then it will use a consistent version and BPMN code everywhere.

If you want more tightly controlled behavior, then you probably need your client to specify an explicit version. That has clear overhead, but it means you can control what is being executed. You can deploy a new version with all its classes or whatever, make sure it works, then tell the client to invoke version Y rather than version X.

We’ve taken a different route. We’re putting an intelligent message buffering system in front of Camunda so that we can pause inbound requests during deployment. If you have a large number of servers, this might be something to consider.