Camunda 7 Tomcat on EC2 Stability Issues

Greetings,

we have deployed a Camunda Platform 7 Apache Tomcat distribution based solution, one for testing and one for production usage, both run on Amazon EC2 instances, the testing server uses internal H2 database and the production server uses a mySQL database in a completely separate EC2 instance.

Each deployment has a single process engine, and Camunda is used only as a process orchestrator through the REST API service, connecting with a middleware layer to our backend systems.
Additionally, we develop locally using a spring boot package.

The configuration of the process engine in bpm-platform.xml is default as it comes from the Camunda Tomcat distribution.

<job-executor>
         <job-acquisition name="default">
         <properties>
         <property name="maxJobsPerAcquisition">5</property>
         <property name="waitTimeInMillis">8000</property>
         <property name="lockTimeInMillis">400000</property>
         </properties>
         </job-acquisition>
         <properties>
         <!-- Note: the following properties only take effect in a Tomcat environment -->
          <property name="queueSize">3</property>
          <property name="corePoolSize">5</property>
          <property name="maxPoolSize">10</property>
          <property name="keepAliveTime">0</property>
         </properties>
  </job-executor>

The workload in the testing server is around 100-200 concurrent instances of processes/tasks, and the production server is currently running with 2700 active instances, most of which consist of user task and wait timers for asynchronous management of processes.

In terms of performance, both servers are fast as expected, the load on the EC2 instances is minimal, with around 500MB of ram usage for the testing server, and 600MB for the production server.

However, we are experiencing stability issues with both environments, seemingly at random intervals Camunda will stop responding and then subsequently crash, meaning Camunda is no longer running (returning 500 when querying the URL and REST API) but the EC2 instance and its operating system remaining active.
I have been trying to trace the cause of the crashes, as we would want Camunda to be running 24/7, especially for our production system. My current patchwork solution was to schedule a checker in our own backend to query the Camunda REST API, and if it fails to respond 5 times in 10 seconds it will send an alert and try to reboot the machine using the

sudo reboot

command.
Camunda has been configured to run on startup using a Daemon on both machines. This allows for a relatively fast restart of Camunda in event of machine maintenance, downtime, restarts and mostly, because Camunda has crashed.

I’m at a complete loss for the reason as to why Camunda stops working, this is an issue that has never ocurred when working locally with spring boot instances, some of which have been running on my own development machine for months at a time, albeit with a much smaller number of active process instances.
It seems that as load has increased, especially in the production environment, the crashes have become more commonplace.

What would be the best way to go about:

  1. Diagnosing the cause of crashes
  2. Improving availability of Camunda, as I understand using a cluster of process-engines is the way to go, but should they all be running in the same EC2 instance and connect to the shared database, or running in different instances?

And finally, could this be an AWS related issue? Perhaps connection timeouts casing the engine to crash?

When you deploy the code to production, you should not run a single node. As we deploy any resources into production, cluster should be created. here is the blog talks about Camunda cluster. Please do make a full analysis and setup an environment before moving into production ready code.

Thanks for the reply,

I read the post but it doesn’t explain what a node entails, perhaps I have missed somewhere in the documentation, I’ve tried to find an example of a cluster setup but was unable to.

Regardless of multiple clusters or not, the issue that the Camunda engine crashes and needs to be restarted is my main concern, as this also affects our test environment.

In the case of production environment, is each “node” a unique server instance, or can multiple nodes (instances of camunda engine running on an apache tomcat server instance) be run in a single server instance?

I have a couple of updates regarding the situation,

  1. we are currently setting up a 3 node cluster with a load balancer for our production environment, this should alleviate most of our issues of uptime
  2. I have identified in the logs that the JVM is indeed crashing when running Camunda, the issue appears to be insufficient memory. Since the servers are running exclusively Camunda and should have more than enough resources I’m a bit loss as to the root cause. We are running using UseCompressedOops as is the default and recommended setting. For now we will increase available RAM far beyond what is in use.

Hi @EPV

What about your processes and the process variables used - are these objects that require a lot of memory? Perhaps big XML docs, files, images etc? Or could you be doing memory intensive operations?

This might explain the random nature…

Hope that helps.

BR
Michael

I hope you have gone through Performance tuning for all the components of Camunda.
Performance tuning Camunda 7 | Camunda Platform 8 Docs.

Thank you for the response, @mimaom ,
No, in our use case Camunda exists only as process orchestrator and task dispatcher, all the variables held in processes are a simple reference string that points to a database for the backend to work with, and everything decision related is also handled by REST endpoints.
Long duration asynchronous tasks, like heavier and slower processes that could result in executions in the minutes/hours and thus timeouts are handled by generating a user task with a unique assignee id that the backend then proceeds forward once the slower off-bpm work is completed.
The most memory/resource intensive use case is processes that have set wait timers (IE: we want a task to appear on date October 25 2023, and the process is started Jan 10 2023). And having a large amount of instances active at a time.

It does still appear that lack of memory is the issue with our stability, so for now we are increasing the amount of memory available to the Camunda servers.

@cpbpm
Thanks for the response, when we initially set up our testing and prod environments, we used the Small server class in the recommended use cases, with a mySQL database instance in a third separate instance, where all our other backend databases also exist.
It does appear we need to move to a Medium class server, and also run more than one node.
As of now, we are currently setting up a 3-node cluster with a load balancer.
I expect this will be sufficient.

Thank you for the useful links and suggestions.