Greetings,
we have deployed a Camunda Platform 7 Apache Tomcat distribution based solution, one for testing and one for production usage, both run on Amazon EC2 instances, the testing server uses internal H2 database and the production server uses a mySQL database in a completely separate EC2 instance.
Each deployment has a single process engine, and Camunda is used only as a process orchestrator through the REST API service, connecting with a middleware layer to our backend systems.
Additionally, we develop locally using a spring boot package.
The configuration of the process engine in bpm-platform.xml is default as it comes from the Camunda Tomcat distribution.
<job-executor>
<job-acquisition name="default">
<properties>
<property name="maxJobsPerAcquisition">5</property>
<property name="waitTimeInMillis">8000</property>
<property name="lockTimeInMillis">400000</property>
</properties>
</job-acquisition>
<properties>
<!-- Note: the following properties only take effect in a Tomcat environment -->
<property name="queueSize">3</property>
<property name="corePoolSize">5</property>
<property name="maxPoolSize">10</property>
<property name="keepAliveTime">0</property>
</properties>
</job-executor>
The workload in the testing server is around 100-200 concurrent instances of processes/tasks, and the production server is currently running with 2700 active instances, most of which consist of user task and wait timers for asynchronous management of processes.
In terms of performance, both servers are fast as expected, the load on the EC2 instances is minimal, with around 500MB of ram usage for the testing server, and 600MB for the production server.
However, we are experiencing stability issues with both environments, seemingly at random intervals Camunda will stop responding and then subsequently crash, meaning Camunda is no longer running (returning 500 when querying the URL and REST API) but the EC2 instance and its operating system remaining active.
I have been trying to trace the cause of the crashes, as we would want Camunda to be running 24/7, especially for our production system. My current patchwork solution was to schedule a checker in our own backend to query the Camunda REST API, and if it fails to respond 5 times in 10 seconds it will send an alert and try to reboot the machine using the
sudo reboot
command.
Camunda has been configured to run on startup using a Daemon on both machines. This allows for a relatively fast restart of Camunda in event of machine maintenance, downtime, restarts and mostly, because Camunda has crashed.
I’m at a complete loss for the reason as to why Camunda stops working, this is an issue that has never ocurred when working locally with spring boot instances, some of which have been running on my own development machine for months at a time, albeit with a much smaller number of active process instances.
It seems that as load has increased, especially in the production environment, the crashes have become more commonplace.
What would be the best way to go about:
- Diagnosing the cause of crashes
- Improving availability of Camunda, as I understand using a cluster of process-engines is the way to go, but should they all be running in the same EC2 instance and connect to the shared database, or running in different instances?
And finally, could this be an AWS related issue? Perhaps connection timeouts casing the engine to crash?