I would like to ask for advice for handling large numbers of processes that may be long running. By large numbers, I mean there could be as tens of thousands of processes running, each of which is invoking sub processes or simply waiting for something to happen. This would be for a single cluster of Camunda servers with a shared database.
If I have say 25,000 “parent” processes, in various states from actively invoking a sub-process or executing a task to simply waiting several days for someone to update a form, we could be talking about 100,000 or more individual activities/processes being “in flight” at once. Can Camunda handle that?
Is the Case Management functionality, which I’ve never used or even investigated, something designed to manage these requirements?
The alternative is to push the “state” of each process as it completes to an external persistence data store and then use a separate process to determine if something needs to be “woken up” or a new process invoked that will pull state information from the external source.
Any direction on this would be much appreciated.
In my experience, Camunda should have no trouble handling the load you are refereeing to. I have direct experience with a deployment where in a peak month we will execute 1million+ process instances. There are references to other larger deployments such as this one 
Case management is more like a style rather than a performance construct. If the paths through your processes can be fully enumerated at design time, use BPMN. If your processes typically require a knowledge worker to assemble and decide on what process fragments to run, then CMMN may be the way to go.
You don’t need to push process state to a data store. That is a key capability of the engine and is taken care of out of the box, including ‘waking up’ those process instances which have something to do.
Firstly, thanks as always Rob for your prompt replies. I appreciate the time you take to help others.
I need to clarify something. The numbers quoted are not cumulative processes run per month. The example perhaps was not clear. I want to know if, for example, a four node cluster of Camunda servers (Server Config: RHEL 6 Linux, 12 CPU, 32 GB Memory, WildFly 10) attached to a single instance MySQL 5.7.16 (commercial) database could be expected to handle 25,000 to 50,000 simultaneous processes (i.e. “running” processes).
In terms of actual processes run per month, we’ll easily exceed 2 to 5 million or more. That said, many of those processes will simply initiate a single workflow that will take no action and terminate. Others may run for an extended period of time managing child workflows and tasks (hence the large number of simultaneous processes).
Finally, it seems to me that scaling in Camunda is really limited by the back end database. Given the job scheduling algorithm Camunda uses, it means load is automatically spread across all servers. We can add more servers if necessary.
There’s two parts to performance in Camunda and what eats up performance:
- Keeping the state of running process instances (i.e. process persistence)
- Changing the state of running process instances (i.e. process execution)
Point 1 is largely influenced by the database setup. You are correct when you write
Point 2 very much depends on the nature of your processes and may be influenced by your application and database server setup.
For example having 25,000 to 50,000 process instances active at the same time should be perfectly fine if they are all waiting for external interactions (e.g. waiting for a user task to complete) where it is unlikely that all tasks are completed in parallel. Waiting for an interaction does not use up application server resources. Active processing of the process instance does. So on the other end of the spectrum, if you have the same number of process instances performing heavy calculations in service tasks certainly results in a performance hit. In this case, you could add processing resources (e.g. more application servers) and like you write, job workload will somewhat be evenly distributed. A rule of thumb is: The less active processing, the more process instances can be active at the same time without noticeable performance degradation.
Excellent information, Thorben. I will share this with my team. It is consistent with what I would have expected. However, only some real world experience will tell.
However, the fact that 25,000+ (or whatever the number is) doesn’t cross some critical threshold is good to know.