Performance issue at 30-40 rps

DmitryM · April 19, 2021, 4:00pm

Problem
Before launching camunda in production we did a load test in stage environment.
30-40 new processes per second were okay.
But adding more load than that resulted in increased count of “in-process” tokens (more processes started than finished at a time unit).
As we have some timers on a BPMN-diagram, that stuckness may result in timer event kicking before a correlation of a message can occur, so the process may finish prematurely, and that’s a problem.

Setup
Camunda
7.14 + spring boot 2.4.5

BPMN-diagram:
mainly consists of several receive-tasks and send-tasks, the latter with “async-before” plus a bit of gateways.

Kubernetes:
2 replicas

requests:
memory: “3Gi”
cpu: 6000m
limits:
memory: “4Gi”
cpu: 8000m

Database:
postgresql 11.5, 4 core, 8GB RAM, HDD (probably slow), pgbouncer with pool of 40 connections to camunda db.

History level:
activity

Settings:

wait-time-in-millis: 2000
max-wait: 8000
max-jobs-per-acquisition: 50
core-pool-size: 16
max-pool-size: 24
queue-capacity: 400
backoff-time-in-millis: 25
max-backoff: 400

Load testing scenario:
We have two topics in kafka (24 partitions each), one for outgoing and other for incoming messages.
A load-generator app produces “start process” messages to incoming topic at given rate (40 rps)
Camunda starts new process internally by consuming such messages and producing new messages to outgoing message topic using java-delegate.
These messages are consumed by load-generator and are immediately replied with new “reaction” message, which mimics a reaction of external service. Such message will be processed in camunda and will be correlated (by business key) to pass over receive task.

Monitoring
Looking at kubernetes dashboard, I found no shortage for memory or cpu in camunda app.
But the database thread pool was full of queries.

I suspect the database being the weakest chain in our setup, but maybe I’m missing something?
Should I try to increase the connection pool in postgres or tune something else?
Maybe database HDD is too slow?

Thanks.

hassang · April 19, 2021, 6:29pm

Hi @DmitryM,

Have you checked below link? It might be of help to you.
https://camunda.com/best-practices/performance-tuning-camunda/

Webcyberrob · April 19, 2021, 11:03pm

Hi,
In my experience, the DB tier has typically been where resource limits are first reached - so yes you want the DB tier to be sized right and fast…

If you can use AWS, then Aurora can be useful as you can easily scale up and down…

If you reach the limits of vertically scaling the DB tier, you could horizonatally scale the DB tier via a sharding approach. Zalando use this technique with 8 shards to achieve massive throughput…

regards

Rob

aravindhrs · April 20, 2021, 3:42am

@Webcyberrob do we have any case studies(zalando’s use case on sharding) regarding this to refer?

Webcyberrob · April 20, 2021, 4:34am

Hi @aravindhrs

This case study may be of interest…Otherwise just google zalando camunda

I believe they based their sharding on a hash of the customer’s email address…

regards

Rob

MaximMonin · April 20, 2021, 11:51am

I was trying to tune database too, spend a lot of time changing different db parameters (using postgress 13) and all methods failed. And I always returned to initial configuration.

The only thing I found strange - there are many inserts and deletes in db as bmpn engine is working. But almost half of time was spendng to operation delete act_ru_execution