What to configure/adjust to reduce the process instance execution latency in Zeebe?

Rafael Pili: Hello! I’m implementing zeebe workers with the spring library in an k8s enviroment. I’m currently stressing the application, with 50VUs, and I configured the HPA. What I noticed, is that even though new pods are being created to support the load, the broker keeps sending more requests to the initial pods (Their CPU load stays much higher than all others). Is this expected behavior? Should I have less replicas of workers but with more resources?

Rafael Pili: Hello! I’m implementing zeebe workers with the spring library in an k8s enviroment. I’m currently stressing the application, with 50VUs, and I configured the HPA. What I noticed, is that even though new pods are being created to support the load, the broker keeps sending more requests to the initial pods (Their CPU load stays much higher than all others). Is this expected behavior? Should I have less replicas of workers but with more resources?

Rafael Pili: my worker configurations: long-pool 5min, maxActiveJobs 20, poll interval 1ms

Rafael Pili: also, I 'm getting really high response times for a fairly small load. The workflow only has two steps which take together 50ms to complete, however, I’m getting response times in 900ms marks…

Rafael Pili: I can atest for the step execution time, since I’m logging them during execution

Rafael Pili: commit latency during testing

Rafael Pili: broker config

resources:
  requests:
    cpu: 5
    memory: 512Mi
  limits:
    cpu: 5
    memory: 512Mi

Rafael Pili: zeebe version 8.0.3

zell: Hey <@U03MFKTCSN5>

could you please share a bit more information on the configuration you’re running?

What type of disk? Size of disk?
Which cloud provider? gke?
How many partitions?
How many nodes?

In general yes it is wise to have less workers, since this can overload the system as well (due to polling mechanism) of course depends on the worker config.

Rafael Pili: hi! I’m using SSD 32Gb for each node, Cloud Provider aws. On this test I was using 3 partitions, I increased to 48 and got much better results, now witth the response time ranging to 500ms. I’m using 3 nodes

Rafael Pili: what is boggling me the most is that I’m getting very high latencies for a small workload. With 2VUs I’m getting an average of 350ms in response time, however the workflow only has 2 steps which take combined more or less 50ms, so zeebe is adding almost 200ms of latency for only 2 VUs, Why is that?

Rafael Pili: I would be really glad if u could help me @zell :smile:

Rafael Pili:

zell: Sorry @Rafael Pili for the late response.

• What is meant by 50VUs?
• How many process instances are you running per second?
• How many should be completed per second?
• How many jobs? How many different job types?
Unfortunately, I have no experience with aws, but on GKE (where we run our benchmarks) the disk size and CPU count <Configure disks to meet performance requirements  |  Compute Engine Documentation  |  Google Cloud used to determine the disk throughput>. Might make sense to increase the disk size in your case as well. Or check whether Zeebe is io throttled.

Did you thought about increase your zeebe cluster to 5 or 7 nodes? to better distribute the partitions?

I guess you use the helm charts?

Rafael Pili: thanks, I will try that. My process only has 2 quick jobs, with no replication. I tried using the disable flush experimental flag, and got much better response times. However, there is a lot of instability, like, the response times are ranging between 70ms-200ms.
50VUs, would be like threads in JMeter, but i’m using k6. That is now getting me a TP of 7k rpm

Rafael Pili: I will try now with 64Gi and 5 nodes, let’s see

zell: > with no replication
Do you mean zeebe without replication? Or no additional workers?

Interesting so you referring to that https://k6.io/docs/using-k6/scenarios/executors/per-vu-iterations/ right? How does your test application look like?

Rafael Pili: Hi, zell, exactly! I mean I was using the following config in my last tests: 3 nodes, 48 partitions and replication factor 1

zell: I think this is related to https://github.com/camunda/zeebe/issues/8132 where we see some issues with the commit latency with no replication

Rafael Pili: yes! it’s probably related, since disabling flush got me absurdly better results

Rafael Pili: i wanted to thank you <@U6WCLLNGJ>, i finally got 10.6k troughput with 128ms of response time following your tips. It was really important to get this response time, thanks!

zell: Awesome! Thats great to hear @Rafael Pili thanks for the feedback :slight_smile:

zell: Do you mind to share your end configuration for posterity ? :slight_smile:

Rafael Pili: sure! just a sec

Note: This post was generated by Slack Archivist from a conversation in the Camunda Platform 8 Slack, a source of valuable discussions on Camunda 8 (get an invite). Someone in the Slack thought this was worth sharing!

If this post answered a question for you, hit the Like button - we use that to assess which posts to put into docs.

Rafael Pili:

resources:
  requests:
    cpu: 800m
    memory: 6Gi
  limits:
    cpu: 1500m
    memory: 6Gi

...

  persistenceType: disk
  # PvcSize defines the persistent volume claim size, which is used by each broker pod <https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims>
  pvcSize: "64Gi"

...

  cpuThreadCount: "2"
  # IoThreadCount defines how many threads can be used for the exporting on each broker pod
  ioThreadCount: "1"

...

  # ClusterSize defines the amount of brokers (=replicas), which are deployed via helm
  clusterSize: "5"
  # PartitionCount defines how many zeebe partitions are set up in the cluster
  partitionCount: "48"
  # ReplicationFactor defines how each partition is replicated, the value defines the number of nodes
  replicationFactor: "1"

...

  env:
    - name: ZEEBE_BROKER_DATA_DISKUSAGECOMMANDWATERMARK
      value: "0.8"
    - name: ZEEBE_BROKER_DATA_DISKUSAGEREPLICATIONWATERMARK
      value: "0.9"
    - name: ZEEBE_BROKER_EXPERIMENTAL_CONSISTENCYCHECKS_ENABLEPRECONDITIONS
      value: "true"
    - name: ZEEBE_BROKER_EXPERIMENTAL_CONSISTENCYCHECKS_ENABLEFOREIGNKEYCHECKS
      value: "true"
    - name: ZEEBE_BROKER_EXPERIMENTAL_MAXAPPENDSPERFOLLOWER
      value: "5"
    - name: ZEEBE_BROKER_EXPERIMENTAL_DISABLEEXPLICITRAFTFLUSH
      value: "true"

zell: BTW just to clarify are these 10 k jobs per minute? If this is the case this would mean 5k Process instance (PI) per minute right? I guess we can fine tune you config a bit more if you like.

By the looks of it you were able to scale the zeebe brokers with HPA.
Does zeebe support hpa and how do we do hpa with. a zeebe cluster, is it supported.
Please can you shed more light