Increasing Paritions Degrades Performance/Throughput

jburwell · October 13, 2023, 3:55pm

We have been benchmarking Camunda performance and scalability using a ten (10)
step workflow with no decision points or loops. Our understanding is that as we
increase the number of partitions to saturate node CPU/memory resources, the overall
throughput of the cluster should increase and the straight through processing time
for each workflow should remain relatively constant. However, our tests have yielded
straight-through performance degradation and relaitvely even throughput as we increase
the number of partitions.

Each step invokes a simple dummy service which echoes the POSTed a ~17k JSON payload
and passes it onto the next step. The test driver is submitting 10 requests/sec to run the workflow.
As we increase the number of partitions across ten(10) dedicated Zeebe nodes to saturate available
CPU and memory, Camunda’s average execution time per workflow significantly deteriorates as follows:

10 partitions: 1.7 seconds (~5% CPU/~15% memory utilized)
20 partitions: 2.1 seconds
30 partitions: 2.9 seconds
40 partitions: 6.3 seconds
120 partitions: 21.3 seconds (~50% of CPU/memory utilized)

All tests were run with a replication factor of 3. We are also not seeing a significant increase in the
number of workflows per second completed as the number of partitions increases. ElasticSearch and
Zeebe node disk and network I/O are low with no indications of being I/O bound.

We have deployed Camunda 8.2.1 to AKS running k8s 1.2.7 with the following
node pools deployed to the EastUS region in the same availability zone:

3x Common Nodes (running Operate, TaskList, and dummy services) (4 vCPU/16 GB RAM)
10x Zeebe/Zeebe Gateway nodes (4 vCPU/16 GB RAM)
3x Elasticsearch nodes (8 vCPUs/32 GB RAM)

All nodes are configured with Temporary kubelet storage, and the Zeebe and
Elasticsearch stateful sets are using the managed-csi-premium storage class.
Finally, the cluster is configured with the Standard SKU for increased master capacity/
reliability, and kubenet networking.

On this cluster, each Zeebe node is running one (1) Zeebe pod and one (1) Zeebe Gateway
pod (in addition to Azure daemonsets for metrics/log collection and CSI). The Zeebe
stateful set is configured with a resource limit and request of 2000m CPU and 5500Mi
memory and the Zeebe Gateway is configured with 400m CPU and 600Mi memory.

Camunda has been installed using the camunda-platform Helm chart with only the resources, node selectors, and tolerations modified for Zeebe, Zeebe Gateway, and Elasticsearch. The back pressure configuration is the default from the Helm chart and/or Zeebe’s internal configuration.

Are there additional tunings (k8s, Azure, Java, and/or Camunda) we should consider
to maintain straight through performance and increase throughput as we increase the
number of partitions?

Thank you for your help.

lenaschoenburg · January 5, 2024, 8:06am

Thanks for the detailed report and sorry for getting back to you so late.

I thought of two things that might be relevant:

Did you adjust cpu and io thread counts? See Broker configuration | Camunda 8 Docs. We generally recommend 0.5-1 cpu thread per partition. Otherwise, all partitions on the same broker must compete for resources.
I suspect that most of the latency degradation is due to job activation which iterates over all partitions round-robin. Job streaming offers a significant improvement, especially for many partitions: Job workers | Camunda 8 Docs