Job Execution Very Poor in Clustered Camunda on WildFly

Summary:

We experienced process start request spikes to a 4 node Camunda cluster that caused job processing to slow substantially. MySQL database server disk I/O maxed out. Taking out 3 of the 4 Camunda servers caused processing to speed up considerably.

Environment:

  • RedHat Linux 6 running as a VMware VM
  • Camunda 7.6.2-ee
  • WildFly 10.1.0
  • MySQL Enterprise 5.7.16 server running as a VMware VM

Problem Detail:

We are running 4 Camunda servers attached to a single, separate MySQL database server. Here’s what happened:

  • We got several “spikes” in process start requests (highest recorded was 1230 in one minute, but that was not sustained). An earlier incident showed spikes of as many as 3330 start requests per minute.
  • We then started to get large numbers of failures and rollbacks in the WildFly server logs (samples below).
  • The disk I/O on the database server then went to a sustained 100%+
  • MySQL process lists showed a large percentage of queries were rollbacks
  • Active processes continued to climb as actual activity execution slowed

When we shut down 3 of the 4 Camunda servers, the total active processes began to fall and the warning messages shown below began to disappear.

One thing we had done prior to all this occurring was in the Job Executor configuration in WildFly. We significantly increased the number of threads and queue lengths in an attempt to resolve a separate performance issue. These values are orders of magnitude higher than the default. Example configuration for all 4 servers:

   <job-executor>
        <core-threads>96</core-threads>
        <max-threads>192</max-threads>
        <queue-length>384</queue-length>
        <job-acquisitions>
            <job-acquisition name="default">
                <properties>
                    <property name="lockTimeInMillis">
                        300000
                    </property>
                    <property name="waitTimeInMillis">
                        5000
                    </property>
                    <property name="maxJobsPerAcquisition">
                        48
                    </property>
                    <property name="backoffTimeInMillis">
                        90
                    </property>
                    <property name="maxBackoff">
                        450
                    </property>
                </properties>
            </job-acquisition>
        </job-acquisitions>
    </job-executor>

Root Cause Theories:

  • We crossed a “threshold” with the spikes wherein the 4 servers were fighting over database resources and the rollbacks just added additional load that wasn’t processing activities.
  • We set the Job Executor configuration way too high and each of those additional threads was putting load on the database server, but it was load that was doing no real work, so removing servers decreased that “useless” load and brought us back under the “threshold”.

I should note that when we got our first spike several days ago, we just let things run and it eventually worked its way through the processes.

WildFly Server Log Messages:

2017-04-11 12:42:08,680 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff87592a5c:-bfe638e:58eccdc4:db in state  SCHEDULE_CANCEL
2017-04-11 12:42:08,680 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 67) ARJUNA012108: CheckedAction::check - atomic action 0:ffff87592a5c:-bfe638e:58eccdc4:da aborting with 1 threads active!
2017-04-11 12:42:08,680 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff87592a5c:-bfe638e:58eccdc4:dc in state  SCHEDULE_CANCEL
2017-04-11 12:42:08,680 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff87592a5c:-bfe638e:58eccdc4:dd in state  SCHEDULE_CANCEL
2017-04-11 12:42:08,680 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff87592a5c:-bfe638e:58eccdc4:de in state  SCHEDULE_CANCEL
2017-04-11 12:42:08,680 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff87592a5c:-bfe638e:58eccdc4:df in state  SCHEDULE_CANCEL
2017-04-11 12:42:08,681 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff87592a5c:-bfe638e:58eccdc4:e0 in state  SCHEDULE_CANCEL
2017-04-11 12:42:08,681 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff87592a5c:-bfe638e:58eccdc4:e1 in state  SCHEDULE_CANCEL
2017-04-11 12:42:08,681 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff87592a5c:-bfe638e:58eccdc4:e2 in state  SCHEDULE_CANCEL
2017-04-11 12:42:08,681 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff87592a5c:-bfe638e:58eccdc4:e3 in state  SCHEDULE_CANCEL

Hi,
If max threads is set to 196 and you have 4 nodes, then you could potentially have 4*196 = 800 connections to the database. If this were the case, I would not be surprised if the database were thrashing etc…

regards

Rob

Are you saying that each of those threads was constantly querying the database, even if no process was allocated to it? Our concern is that processes start requests will get rejected somehow if there are not enough threads. We have 4 Camunda servers behind a load balancer.

My perception is that there is a strong focus on human centered tasks by Camunda versus pure “touchless” automation. I say that based upon the types of examples they provide on Github and other general documentation. This is not a criticism, just my perception. We are focused almost entirely on non-human interactive automation at very high volumes.

The changes we made were probably the wrong choice, but we were trying to maximize the throughput of the servers. Are there any general guidelines to do this? Consider the following:

  • MySQL database server supports maximum of 5000 connections
  • Each WildFly/Camunda server has a datasource pool with a minimum of 150 connections and a maximum of 300
  • Each WildFly/Camunda server (a VMware guest) has 12 “cores” and plenty of memory
  • Both the 4 WildFly/Camunda and 1 MySQL servers are on the same subnet
  • The WildFly/Camunda servers are behind a load balancer, so incoming REST requests are spread evenly across them

Our thinking (perhaps in error) was that if we’ve got a huge number of requests hitting while there are a huge number of active processes, then we needed to allocate more threads, even though in theory only 12 (total CPU cores) could execute at once. Most of our processes take only 5 to 15 seconds to execute and are very active during their lifetime.

Camunda has said that it doesn’t matter (within reason) how many active processes you have in a clustered environment, but rather what are those processes doing. If they’re all long-running processes that “asleep”, then it’s not an issue. But if each of them were running a Java class that did some form of sophisticated CPU intensive calculation, then 100 of them could max out the Camunda server resources.

As always, Rob, your thoughts and recommendations are appreciated. I hope I can return your generosity some day.

Michael

Hi Michael,

For some general reading on cluster tuning, this blog post may be of interest.

Im not an expert on wildfly configuration, however I shall talk to some Tomcat concepts which may resonate. On tomcat, there are two distinct thread pools. One is for foreground requests, eg REST API calls or tasklist/cockpit initiated calls. The second pool is the job executor thread pool for background job execution or asynchronous continuations. Hence your foreground threadpool should be set large enough for the number of concurrent client requests you expect. 3000 request per minute is only 50 per second. Given you have 4 nodes, I would expect 25 foreground threads per server to be adequate.

Now both thread pools will use a common database connection pool. Hence the number of connections in the connection pool should be as small as possible, but large enough for both thread pools. Hence I would size it to match the sum of max threads in each pool.

To set the size of the job executor thread pool depends. If your background tasks are cpu intensive, keep it low - eg twice number of cpu cores. If you are making lots of network API calls and your threads are thus I/O bound, then increase the size of the job executor pool to be say 8 times CPU cores.

Next you need to tune the job acquisition strategy. In particular, if you tend to have a backlog of async jobs, then get more jobs per acquisition cycle. Perhaps use a number say half the number of max job executor threads.

These kinds of number would be my starting numbers, tuning can take a lot of experimentation, but as a rule of thumb, less is typically more.

You also need to ensure the DB tier is tuned for your usage scenario, but thats another forum…

regards

Rob

The only problem I encountered in running a baseline, or smoke, test on a clustered Camunda platform was in generating sufficient inbound events to stress the system… (i.e. cause a system ‘panic’).

In other words, Camunda cluster configuration works.

Under heavy stress, the first evidence of failure was that it started reporting “optimistic transaction” exceptions. And, this issue was solved.

As far as reliability and fail-over, the system performed well on an open-source stack. I used NGINX as the load balancer. Don’t forget that Camunda’s web-apps require “sticky” sessions. Meaning that, once a user establishes a session on node ‘A’, their session maintains an affinity on node ‘A’. Attempting to sync’ sessions between the clustered nodes wasn’t considered essential - so, just required an active user’s session to be re-established on the secondary node.

Would you mind sharing the general configuration and volume details? I just posted another topic that give a general overview of ours.

We don’t know exactly where the problem is. We’re going to do more formal testing next week.

I know Camunda itself can handle high volumes, but it’s difficult to quantify what that means. The number of active processes isn’t relevant if they’re all asleep. But if you have 40,000 active processes, all of them wanting to execute their next task (even if the task itself is not particularly CPU intensive), then what?

Thanks.

Michael