Cluster configuration with dedicated job executor nodes

zambrovski · August 9, 2016, 12:33pm

Hi guys,

I’m interested in creation of a cluster environment with a homogenous Camunda BPM Cluster having some of the nodes dedicated to online transactions (e.g. serving web requests for the user tasks and task list) and some of the nodes dedicated to the batch processing (execution of async service tasks). My plan is to tune up the batch servers to achieve maximum throughput - I use hystrix to implement the bulk head pattern and limit the thread pool sizes of all components using communication to external systems. I plan to give batch nodes a thread pool of about 50 threads (instead of 10).

I’v read the documentation regarding job executor configuration and have several questions to it:

Does it make sence to balance the nodes so extremely (no batch at all / all batch)?
Can I just disable job executor on the online-processing nodes?
How big should the fetch size of the job executor be?

Kind regards,

Simon

mppfor_manu · October 29, 2016, 3:33pm

We are using homogeneous clusters using WildFly 10. If you are planning for all nodes to use a common back end database, then be aware of the following (I realize this may already be obvious to you). Please note that what follows may be specific to WildFly using a shared engine and not applicable to your environment. Also note that my relative lack of experience in Java may result in ignorant statements.

Deployment of BPMN code that requires custom Java classes can be problematic because any node in the cluster could execute any task requiring those classes. If the class does not exist on the nodes Java container, the task will fail. To be sure, this can be mitigated somewhat by setting the " true" tag in the processes.xml file, which will keep process execution requiring specific classes off of servers that do not have those classes.

You have to remember that the shared database contains the process BPMN and script code and as such, the process definition because available to all nodes the moment it deploys to any one node. However, custom Java classes do not become available on the node until they are actually deployed to it. You could deploy them separately from the BPMN code, but you would then have to control the versions executed by your BPMN code so that there is consistency in operation.

For example, if you have a long running process, when it gets to a particular task requiring a custom Java class, the class used will typically be whatever was deployed last. In other words, only version of the class is available no matter what version of process you are running. You could mitigate, in theory, by specifying an actual version of a the class to be executed.

If you have more than one node to which requests may be sent and you deploy custom classes, there are very few ways to prevent class not found errors. Imagine a scenario where you have three nodes. Because each node can run any task within the process, then any node running a task requiring a non-existent class will throw a class not found error. So your deployment must either occur simultaneously on all nodes at precisely the same time or you need to suspend the inbound requests while the deployment takes place.

Camunda have suggested that using an “async before” configuration at the start of the process would permit you to suspend process execution during deployment. You could then suspend the process during deployment, then allow them to run after deployment. You could also, in theory, suspend process execution for the entire process definition. However, if new start requests are received during that suspension, they will be dropped.

The only real solution that works for me is to use an external buffering mechanism that can suspend requests during deployments. You can then deploy updated war files to all servers, assuming that any included Java classes are compatible with all process instance versions still running.

I also acknowledge that my ignorance is such that I might have missed an obvious answer here or that I failed to understand your question.

Michael