Under-utilized (cpu) for gateway and broker pods

We are load testing zeebe 8.1.6 deployed on k8s through helm charts. With 2 gateways and 6 brokers we found that one gateway pod is overloaded while the cpu usage of another gateway instance is below 3%. Same goes with brokers. There is one broker with 111% usage while another one has only 40% usage. Attached is the screenshot from grafana dashboard

Below is the relevant helm values file snippet.

Could someone please review the configuration and let me know if something is incorrect.

Thanks.

zeebe:
  clusterSize: 6
  partitionCount: 12
  replicationFactor: 2
  cpuThreadCount: 4
  ioThreadCount: 4
  env:
    - name: ZEEBE_BROKER_EXECUTION_METRICS_EXPORTER_ENABLED
      value: "true"
  pvcSize: 40Gi
  resources:
    requests:
      cpu: 3
      memory: 4Gi
    limits:
      cpu: 3.5
      memory: 4.5Gi

zeebe-gateway:
  replicas: 2
  env:
    - name: ZEEBE_GATEWAY_THREADS_MANAGEMENTTHREADS
      value: "4"
    - name: ZEEBE_GATEWAY_MONITORING_ENABLED
      value: "true"
  resources:
    requests:
      cpu: 3
      memory: 4Gi
    limits:
      cpu: 3.5
      memory: 4.5Gi

Some thoughts:

You want to correlate this with the topology of the cluster. Probably you’ll find higher CPU usage on the partition leader.

Also - from memory: depending on how you start the process instances will influence which partition it starts on. Process instances that are started by messages are always started on the same partition.

Josh

Hey @jgeek1

regarding the gateway I have opened an issue some weeks ago One gateway is always prefered · Issue #11310 · camunda/zeebe · GitHub because I also run into this. Please see the issue for more details.

Greets
Chris

1 Like

Thanks for the replies @jwulf & @Zelldon

Right. I see two workarounds to this problem from the github links

  1. Add a load balancer in front of the zeebe-gateways like kong
  2. Enable grpc load balancing on the zeebe-client that triggers the processes.

Will check on this further.

Ok, what is the recommendation for the number of partitions @jwulf ? For e.g. should they be double the number of brokers? Is there some kind of recommendation like this? Related to this would be replicationFactor.

Thanks

We executed a long-running 3 hour stability test with 8 brokers, 8 partitions and 2 gateways. We again observed under-utilised zeebe brokers. The cpu usage is around 41% and memory usage is around 45%. Below are the grafana screenshots.

If we reduce the brokers to 7 and equal number of partitions we don’t achieve the throughput we desire of 200 PI/s. We get slight backpressure with 7 brokers config.

Could you guide us @Zelldon/@jwulf on what could be tuned to improve the resource usage on the brokers and thereby improve throughput?

Thanks.