Process Start Failures in Clustered Environments for Updated Workflows

We have a “clustered” environment (Camunda 7.5.3-ee/WildFly 10.0.0-FINAL using a shared MySQL database). A load balancer takes incoming requests and distributes them across all nodes.

Because all nodes share the same datasource, they are all “willing” to execute a start for a specific workflow. However, this workflow has custom classes. The initial deployment (deployment aware set in processes.xml) puts the custom classes on every node.

We then update the process and add or modify existing classes. We must now deploy the updated workflow to all nodes. The incoming messages do not stop and are still distributed across all nodes. However, when the deployment takes place, it proceeds serially through each of the nodes. This causes a problem because after deployment to the first node, the BPMN code stored in the database becomes the “new” version. So if a start request is received by the second node, which may not yet have completed the deployment, it will fail because the necessary classes are not yet present. These requests are effectively “dropped” and so they are lost.

Several solutions suggest themselves:

  • Stop start requests to all nodes where request is not yet deployed and allow requests to hit more nodes as the updated deployment completes
  • Force all clients to specify a process version (not really an option)
  • Buffer all inbound requests for that workflow until the deployments complete
  • Perform parallel deployments to all nodes
  • Deploy the updated classes separately from the BPMN code as global modules within WildFly to all node prior to deploying the updated BPMN code that would call them.

I was wondering if you have any “best practices” here that we should consider. The bottom line is, we cannot lose any inbound request to the platform.

Thanks.

Hi @mppfor_manu,

If I understand you correctly, taking out node which is currently under update action from the load balancer would solve the problem, or I am missing something? If you do that, you just have to make sure that requests to already updated node don’t make data in the database inconsistent.

One more question is: are you worried about rest requests here, or something that you have to display to user?

Cheers,
Askar.

No, that’s the not the problem. One challenge in a clustered environment is making sure every node always has every custom Java class. If a node executes a task that requires a class that it does not have, it will fail.

The burden of dealing with this is on the developer to trap the error and somehow deal with it. If you simply retry, you could hit any number of nodes that don’t have the class yet. Imagine you have 5 nodes. You update the workflow with a war file containing new classes. The update is applied to node #1. Immediately, tasks from that workflow could be executed by nodes 2 through 5, even if they new class is not available.

If you do a WildFly deployment “replace”, the class disappears entirely from the node. If you do an update containing the same class names, but with updated code in the class, then a class not found error will not occur, but it will still fail on nodes that don’t have the updated class.

This issue is true primarily for updates because you can block execution of new code on nodes where the classes don’t exist by making it deployment aware.

Now, if you really wanted to “hot deploy”, you would need to split the custom classes from the BPMN code in your deployment and you would need to call specific versions of the classes that were tied to specific versions of your BPMN code. I’m not sure how this would actually work, but you would deploy all the classes as modules first, then deploy the BPMN code. This means that as each version of BPMN code is deployed, it would call a very specific version of classes that had already been deployed. This also allows you to run versions of the BPMN code against specific versions of classes.

Again, I’ve not even looked into whether this is possible or not.

Hi @mppfor_manu,

Can we go step by step over process that you describe here

If you do a WildFly deployment “replace”, the class disappears entirely from the node. If you do an update containing the same class names, but with updated code in the class, then a class not found error will not occur, but it will still fail on nodes that don’t have the updated class.

So you have nodes A and B in cluster with a load balance in front of them. What is the sequence of your actions?

  1. Chose node you want to update - let’s start with A
  2. take it out from load balancer
  3. update node A
  4. allow load balancer send requests to node A
  5. take out node B from load balancer
  6. update node B
  7. allow load balancer send requests to node B

is that what you are doing? steps 4 and 5 have to happen simultaneously probably.

Cheers,
Askar.

Taking the node out of the load balancer just means it can’t receive a new request from an external client, whatever that might be. Simultaneous deployment isn’t really something you can achieve in the real world because you cannot control when the job executor will pick up a job to run.

The problem is, unless you completely shut down the Camunda engine on the node, it is still asking the database for jobs to execute. Because the workflow was deployed to it at least once, could it not then ask for a job that contains a service task requiring a custom Java class that is not loaded on it yet?

I’ve confirmed that if you deploy the workflow to node A and B, and execute a start request to both nodes, it works as expected. However, if you update a custom java class in the workflow, deploy it only to node A, and then send a start request to both nodes, they will both attempt to execute it, but only node A will execute successfully. Node B will fail because it does not have the updated class.

In WildFly you do a hot redeployment of a workflow. The custom java class is replaced by whatever is included in your updated war file for ALL tasks that use the class. This means long running processes might start by using version 1 of the class and complete using version 2. That’s because Camunda (or maybe it’s WildFly) only makes available one “global” instance of a class. It does not maintain workflow version specific relationships within a deployment (i.e. If you deploy workflow war file version 1 with Java classes, then when you deploy workflow ware file version 2, all processes (both versions 1 and 2) will use ONLY the new class.

The issue here is that the BPMN code is persisted in the shared database, making any changes immediately available to all nodes. Any associated Java classes are not shared and therefor you run the risk of executing a task on a node that doesn’t have the class required.

I could be completely wrong about all of this, but my testing shows it to be true.

Michael

Hi Michael,

you can control the execution of jobs in a way that you can suspend the execution for a certain time, maybe the time of your deployment. See here for the concept: https://docs.camunda.org/manual/7.5/user-guide/process-engine/process-engine-concepts/#suspend-and-activate-job-execution

If you suspend all job-definitions before you start the deployment of a new version of the war, no process instance will execute further. If you mark the start event of your processes with async after, new process instances can even be started, but they will stop right after the start event and don’t execute any java code from the service tasks.

After the deployment is competed on all nodes of the cluster, you can activate the job definitions again, and the jobs get executed in the deployed version of the java classes.

BTW, it’s the concept of namespaces in the JVM that is responsible for having only one version of a class available in a classloader. The JEE servers like Wildfly reduces this, a class can only be once inside an application as every application gets it own classloader.

Hope this helps, Ingo

You’re suggestion has merit, but does not address the issue of how a particular process task might execute if the new Java class were incompatible with it. You might say that backward compatibility of classes is the responsibility of the developer so that no matter what version is deployed, existing process instances will execute properly. That would certainly be fair.

However, even with this suggestion, you will still have the issue of the Java annotation “@ProcessApplication” conflicting with existing instances. We know we can make this go away by removing that annotation and using the ejb3 client. However, our developers are using the interface exposed by the annotation to implement activity monitoring without the need to put those monitors (logging) in workflow execution listeners. In other words, if we deploy with our current structure, we cannot redeploy without completely removing the existing deployment.

But, we’ve been over this before. I think the problem here is twofold:

  • You cannot easily maintain a separate set of Java classes associated with a specific workflow version
  • We have used the “@ProcessApplication” Java annotation in a manner that prevents redeployment of workflows

The first one will be difficult to resolve if that’s even possible. The second one is mostly our problem.