Increasing Paritions Degrades Performance/Throughput

We have been benchmarking Camunda performance and scalability using a ten (10)
step workflow with no decision points or loops. Our understanding is that as we
increase the number of partitions to saturate node CPU/memory resources, the overall
throughput of the cluster should increase and the straight through processing time
for each workflow should remain relatively constant. However, our tests have yielded
straight-through performance degradation and relaitvely even throughput as we increase
the number of partitions.

Each step invokes a simple dummy service which echoes the POSTed a ~17k JSON payload
and passes it onto the next step. The test driver is submitting 10 requests/sec to run the workflow.
As we increase the number of partitions across ten(10) dedicated Zeebe nodes to saturate available
CPU and memory, Camunda’s average execution time per workflow significantly deteriorates as follows:

  • 10 partitions: 1.7 seconds (~5% CPU/~15% memory utilized)
  • 20 partitions: 2.1 seconds
  • 30 partitions: 2.9 seconds
  • 40 partitions: 6.3 seconds
  • 120 partitions: 21.3 seconds (~50% of CPU/memory utilized)

All tests were run with a replication factor of 3. We are also not seeing a significant increase in the
number of workflows per second completed as the number of partitions increases. ElasticSearch and
Zeebe node disk and network I/O are low with no indications of being I/O bound.

We have deployed Camunda 8.2.1 to AKS running k8s 1.2.7 with the following
node pools deployed to the EastUS region in the same availability zone:

  • 3x Common Nodes (running Operate, TaskList, and dummy services) (4 vCPU/16 GB RAM)
  • 10x Zeebe/Zeebe Gateway nodes (4 vCPU/16 GB RAM)
  • 3x Elasticsearch nodes (8 vCPUs/32 GB RAM)

All nodes are configured with Temporary kubelet storage, and the Zeebe and
Elasticsearch stateful sets are using the managed-csi-premium storage class.
Finally, the cluster is configured with the Standard SKU for increased master capacity/
reliability, and kubenet networking.

On this cluster, each Zeebe node is running one (1) Zeebe pod and one (1) Zeebe Gateway
pod (in addition to Azure daemonsets for metrics/log collection and CSI). The Zeebe
stateful set is configured with a resource limit and request of 2000m CPU and 5500Mi
memory and the Zeebe Gateway is configured with 400m CPU and 600Mi memory.

Camunda has been installed using the camunda-platform Helm chart with only the resources, node selectors, and tolerations modified for Zeebe, Zeebe Gateway, and Elasticsearch. The back pressure configuration is the default from the Helm chart and/or Zeebe’s internal configuration.

Are there additional tunings (k8s, Azure, Java, and/or Camunda) we should consider
to maintain straight through performance and increase throughput as we increase the
number of partitions?

Thank you for your help.