Zeebe is unreasonably slow with out-of-the box setup

vasyl · November 22, 2022, 11:22am

We’ve been trying to use zeebe for some time and although some people online confirm that it works with millions or even billions of operations/users it’s not coping with ~5 concurrent users in our case. That’s very weird and we can’t figure out why.

Earlier we’ve been using a standalone version of zeebe (for dev needs) and faced the same problem. At that point in time, we did some investigation and it seems like RocksDB was flushing data to SSD on every process step to guarantee data persistency and although each flush wasn’t critically slow the sum of all of them was a performance killer.

To bypass this we’ve temporarily disabled it with disableExplicitRaftFlush=true experimental feature but we can’t use it as a final solution on prod for various reasons and we’ve been advised to use cluster versions on Zeebe. As far as I understand the reasoning was - in cluster versions we can have replication, so one node failure doesn’t lead to data loss & thus there is no need to flush data to SSD so frequently.

The complete conversation is available here:

Recently we’ve tried a cluster version of zeebe with 5 nodes and 5 partitions, replication, a standalone gateway, and slightly increased resources. Basically, we’ve followed Zelldon’s advice in our previous discussion. We’ve also disabled this hacky disableExplicitRaftFlush and now use defaults. As the result, the system is a bit more stable but it also becomes slower.
Zeebe is deployed in Kubernetes with the default Helm chart

Unfortunately, we’re not allowed to share implementation details but just to give a sense of how bad things are I can share some timings. For example, if we sum up the execution time of all 4 workers we get ~1-1.5 seconds now, but the total execution time is ~3-4 seconds (75th percentile) for 5 concurrent users. If we run the same test with 100 concurrent users total execution time can easily rise to 30-50 secs at best and 100+ seconds at worse. It’s also likely to drop many requests due to overload.

Switching disableExplicitRaftFlush back to true makes things better, so we may assume that at least a part of the problem is somewhere in exporters.

We have a gut feeling that we’re missing something very basic because why else the system that can handle a huge load does fail to handle a minimal load on our side? Do you have any ideas why could be wrong here?

Thanks in advance

Zelldon · November 24, 2022, 12:52pm

Hey @vasyl

sorry for the late reply and that you have still issues with the performance.

Before I can give your more input or hints I need to understand a bit better your setup, use case and workload.

Could you please answer the following questions:

Which version of Zeebe do you use?
What type of client do you use? Which language?

not coping with ~5 concurrent users in our case.

What exactly do you mean by users here? Users accessing Tasklist? 5 different clients? Could you describe a bit more your setup and the tools you use around Zeebe? To understand that better.

You mentioned you run the helm charts, can you share the values files you’re using, or do you not configure anything more?

Recently we’ve tried a cluster version of zeebe with 5 nodes and 5 partitions, replication, a standalone gateway, and slightly increased resources.

Please share more details about that. How does your configuration look right now?
Which cloud provider do you use? GKE? AWS?

Unfortunately, we’re not allowed to share implementation details but just to give a sense of how bad things are I can share some timings.

Here we need to know how long the job worker execution time is, and what your process model looks like. If you are not allowed to share it maybe you can tell us how many tasks you have in your model (are they concurrently etc.). Do you use multi-instance or call activities?
How many process instances are you running concurrently? In general, the load on the system would be quite interesting.

What kind of payload do you have for your process instances? How many variables and how big are they?

Are you using any exporters?

Do you have metrics we can take a look at? I would advise to run Prometheus and grafana with your zeebe cluster deployment so you have better insights. Here you find a grafana dashboard you can use zeebe/monitor/grafana at main · camunda/zeebe · GitHub

What type of disks you’re using? How big they are? Can you see how your disks are performing? Any metrics from your cloud provider maybe?

After answering these question I hope we can get a bitter insight what is happening and help you here.

Greets
Chris

cyberpank · November 24, 2022, 1:50pm

Hello @Zelldon , I’ll try to answer your questions, @vasyl probably will add later regarding the workers and clients

8.0.5 zeebe version, Java 17 client for workers

Cluster deployed to k8s on bare metal ( VM’s in the local data center)
Currently we have 5 zeebe brokers and 1 standalone gateway in the cluster

We use hazelcast exporter to the external HZ cluster and simple-monitor. It doesn’t work, but it is another problem anyway

The type of disks is SSD and they should be reasonably fast.
Regarding metrics - we have metrics collected by Prometheus and showing in Grafana. Actually, there are dozens of them and I can not see any signs of slow processing. We can share any dashboards if it helps.

Broker configuration

{
  "network" : {
    "host" : "0.0.0.0",
    "portOffset" : 0,
    "maxMessageSize" : "4MB",
    "advertisedHost" : "zb-zeebe-0.zb-zeebe.dev.svc.cluster.local",
    "commandApi" : {
      "host" : "0.0.0.0",
      "port" : 26501,
      "advertisedHost" : "zb-zeebe-0.zb-zeebe.dev.svc.cluster.local",
      "advertisedPort" : 26501,
      "address" : "0.0.0.0:26501",
      "advertisedAddress" : "zb-zeebe-0.zb-zeebe.dev.svc.cluster.local:26501"
    },
    "internalApi" : {
      "host" : "0.0.0.0",
      "port" : 26502,
      "advertisedHost" : "zb-zeebe-0.zb-zeebe.dev.svc.cluster.local",
      "advertisedPort" : 26502,
      "address" : "0.0.0.0:26502",
      "advertisedAddress" : "zb-zeebe-0.zb-zeebe.dev.svc.cluster.local:26502"
    },
    "security" : {
      "enabled" : false,
      "certificateChainPath" : null,
      "privateKeyPath" : null
    },
    "maxMessageSizeInBytes" : 4194304
  },
  "cluster" : {
    "initialContactPoints" : [ "zb-zeebe-0.zb-zeebe.dev.svc.cluster.local:26502", "zb-zeebe-1.zb-zeebe.dev.svc.cluster.local:26502", "zb-zeebe-2.zb-zeebe.dev.svc.cluster.local:26502", "zb-zeebe-3.zb-zeebe.dev.svc.cluster.local:26502", "zb-zeebe-4.zb-zeebe.dev.svc.cluster.local:26502" ],
    "partitionIds" : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ],
    "nodeId" : 0,
    "partitionsCount" : 8,
    "replicationFactor" : 2,
    "clusterSize" : 5,
    "clusterName" : "zb-zeebe",
    "heartbeatInterval" : "PT0.25S",
    "electionTimeout" : "PT2.5S",
    "membership" : {
      "broadcastUpdates" : false,
      "broadcastDisputes" : true,
      "notifySuspect" : false,
      "gossipInterval" : "PT0.25S",
      "gossipFanout" : 2,
      "probeInterval" : "PT1S",
      "probeTimeout" : "PT0.1S",
      "suspectProbes" : 3,
      "failureTimeout" : "PT10S",
      "syncInterval" : "PT10S"
    },
    "raft" : {
      "enablePriorityElection" : true
    },
    "messageCompression" : "NONE"
  },
  "threads" : {
    "cpuThreadCount" : 8,
    "ioThreadCount" : 8
  },
  "data" : {
    "directory" : "/usr/local/zeebe/data",
    "logSegmentSize" : "128MB",
    "snapshotPeriod" : "PT5M",
    "logIndexDensity" : 100,
    "diskUsageMonitoringEnabled" : true,
    "diskUsageReplicationWatermark" : 0.99,
    "diskUsageCommandWatermark" : 0.97,
    "diskUsageMonitoringInterval" : "PT1S",
    "logSegmentSizeInBytes" : 134217728,
    "freeDiskSpaceCommandWatermark" : 61399204,
    "freeDiskSpaceReplicationWatermark" : 20466401
  },
  "exporters" : {
    "hazelcast" : {
      "jarPath" : "/usr/local/zeebe/exporters/zeebe-hazelcast-exporter-jar-with-dependencies.jar",
      "className" : "io.zeebe.hazelcast.exporter.HazelcastExporter",
      "args" : {
        "clusterName" : "dev",
        "name" : "zeebe-dev",
        "remoteAddress" : "hz-hazelcast.test:5701",
        "remoteConnectionTimeout" : "PT30S"
      },
      "external" : true
    }
  },
  "gateway" : {
    "network" : {
      "host" : "0.0.0.0",
      "port" : 26500,
      "minKeepAliveInterval" : "PT30S"
    },
    "cluster" : {
      "contactPoint" : "0.0.0.0:26502",
      "requestTimeout" : "PT15S",
      "clusterName" : "zeebe-cluster",
      "memberId" : "gateway",
      "host" : "0.0.0.0",
      "port" : 26502,
      "membership" : {
        "broadcastUpdates" : false,
        "broadcastDisputes" : true,
        "notifySuspect" : false,
        "gossipInterval" : "PT0.25S",
        "gossipFanout" : 2,
        "probeInterval" : "PT1S",
        "probeTimeout" : "PT0.1S",
        "suspectProbes" : 3,
        "failureTimeout" : "PT10S",
        "syncInterval" : "PT10S"
      },
      "security" : {
        "enabled" : false,
        "certificateChainPath" : null,
        "privateKeyPath" : null
      },
      "messageCompression" : "NONE"
    },
    "threads" : {
      "managementThreads" : 1
    },
    "security" : {
      "enabled" : false,
      "certificateChainPath" : null,
      "privateKeyPath" : null
    },
    "longPolling" : {
      "enabled" : true
    },
    "interceptors" : [ ],
    "initialized" : true,
    "enable" : false
  },
  "backpressure" : {
    "enabled" : true,
    "algorithm" : "VEGAS",
    "aimd" : {
      "requestTimeout" : "PT1S",
      "initialLimit" : 100,
      "minLimit" : 1,
      "maxLimit" : 1000,
      "backoffRatio" : 0.9
    },
    "fixed" : {
      "limit" : 20
    },
    "vegas" : {
      "alpha" : 3,
      "beta" : 6,
      "initialLimit" : 20
    },
    "gradient" : {
      "minLimit" : 10,
      "initialLimit" : 20,
      "rttTolerance" : 2.0
    },
    "gradient2" : {
      "minLimit" : 10,
      "initialLimit" : 20,
      "rttTolerance" : 2.0,
      "longWindow" : 600
    }
  },
  "experimental" : {
    "maxAppendsPerFollower" : 2,
    "maxAppendBatchSize" : "32KB",
    "disableExplicitRaftFlush" : false,
    "rocksdb" : {
      "columnFamilyOptions" : { },
      "enableStatistics" : false,
      "memoryLimit" : "512MB",
      "maxOpenFiles" : -1,
      "maxWriteBufferNumber" : 6,
      "minWriteBufferNumberToMerge" : 3,
      "ioRateBytesPerSecond" : 0,
      "disableWal" : true
    },
    "raft" : {
      "requestTimeout" : "PT5S",
      "maxQuorumResponseTimeout" : "PT0S",
      "minStepDownFailureCount" : 3,
      "preferSnapshotReplicationThreshold" : 100,
      "preallocateSegmentFiles" : true
    },
    "partitioning" : {
      "scheme" : "ROUND_ROBIN",
      "fixed" : [ ]
    },
    "queryApi" : {
      "enabled" : false
    },
    "consistencyChecks" : {
      "enablePreconditions" : false,
      "enableForeignKeyChecks" : false,
      "settings" : {
        "enablePreconditions" : false,
        "enableForeignKeyChecks" : false
      }
    },
    "features" : {
      "enableYieldingDueDateChecker" : false
    },
    "maxAppendBatchSizeInBytes" : 32768
  },
  "executionMetricsExporterEnabled" : false
}```

vasyl · November 24, 2022, 10:57pm

Hi @Zelldon

Thanks for the reply (and @cyberpank for providing some answers)
I’ll try to cover the remaining questions

What type of client do you use? Which language?

As was previously mentioned we’re using Java 17 (openjdk 17.0.4).
It’s a spring boot application so we’re using a starter

<dependency>
    <groupId>io.camunda</groupId>
    <artifactId>spring-zeebe-starter</artifactId>
    <version>1.3.0</version>
</dependency>

What exactly do you mean by users here? Users accessing Tasklist? 5 different clients? Could you describe a bit more your setup and the tools you use around Zeebe? To understand that better.

To test performance we’re using JMeter 5.5, so by 5 uses, I meant 5 threads.
We have 30 seconds warm-up period followed by 12-15 mins of actual testing.
Each test is a series of calls. First, we create an entity in a database, make some changes that don’t require Zeebe & test that everything is ok and ready for the heavy part. Next we “release” this entity that triggers Zeebe process of a few workers, checks that it has passed, and launch the same process again, but with different values (similar to PUT in the REST world).

I don’t think I’m allowed to share the real BPMN file but I think I can provide an anonymized version of it (please see attached schema.bpmn and schema.png files)
schema.bpmn (28.6 KB)

In our case, we trigger the first subprocess with 2 workers A and B & final subprocess with worker N. Workers A and N are running in service #1 and worker B belongs to service #2.

In general, the load on the system would be quite interesting.

Do you have metrics we can take a look at?

Regarding the load, we’re not trying to break the system (at least not yet), just to make it faster, so currently it’s about 0.8-1 request per second.
Since we receive 1 request for processing approximately every second and processing takes 3-5 seconds I think we can estimate that we run 3-5 processes concurrently.

We’ve launched tests many times so I’ll share the most typical picture. Please the image in the next post (as a new user I’m not allowed to attach more than 1 media file, so I have to split the message)

For combined timing, you can see requests per second on the left and overall execution time on the right (in seconds). Here 0.25, 0.5, and so on are percentile values.
How it was measured: just before sending a task to Zeebe metric timer is started and it’s stopped on response callback (method whenComplete() in java code)

For workers, you can see timing (please ignore 0.99 and 0.999 outliers, that seems like a DB timeout, not related to Zeebe)
How it was measured: timer is started as soon as possible in @ZeebeWorker method and stopped in the final block (basically on return to broker).

What kind of payload do you have for your process instances? How many variables and how big are they?

The payload is JSON that holds control variables for business logic and values that should be passed from one worker to another (and sometimes saved in DBs).

Initial JSON contains about 30 fields on different nestedness levels and takes ~1.5KB. In later phases, it may grow. I don’t have exact numbers but a rough estimate is about 50-70 fields (or about 200 lines of formatted JSON) that take 7-10Kb

Hope it answers your questions but please let us know if something was missed or if you need more details.

Thank you!

vasyl · November 24, 2022, 10:58pm

Zelldon · November 25, 2022, 3:01pm

Hey,

thank you both for your answers and all the information you provided! I have some follow-up questions

Your broker config shows:

    "partitionIds" : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ],
    "nodeId" : 0,
    "partitionsCount" : 8,

I’m wondering how this can happen. Did you change the configuration in between?
Regarding the replication factor of 2, what was the reason for pick this count? I suggest to increase it to at least three. See also other posts regarding this:

Can you also share the configuration from the standalone gateway?
I would also be interested in the Helm values file if you have any?
Are you planning to migrate to the most recent version in the near future? We already have 8.1.x released (8.1.4 zeebe - 8.1.6 for spring-zeebe).

Cluster deployed to k8s on bare metal ( VM’s in the local data center)

Do you have insights into the Disk performance, IOPS?

Regarding metrics - we have metrics collected by Prometheus and showing in Grafana. Actually, there are dozens of them and I can not see any signs of slow processing. We can share any dashboards if it helps.

This would be really interesting. Are you using the Zeebe Dashboards? If this is the case could you share the panels under the latency section? Under “Processing” I would like to see the “Processing Queue Size”. A screenshot of the General View would be nice as well.

To test performance we’re using JMeter 5.5, so by 5 uses, I meant 5 threads.

Ok, this is interesting and something we have to take a deeper look I think.

Next we “release” this entity that triggers Zeebe process of a few workers, checks that it has passed, and launch the same process again, but with different values (similar to PUT in the REST world).

As far as I understand you create a process instance and verify whether the process instance is completed in this thread. Where are the workers running? In a different thread or on a different machine? Sorry if I have overread that.

but I think I can provide an anonymized version of it

Thanks that is great! Thanks for sharing the model that helps.

According to the model, this means you have in the worst case 14 tasks to complete, plus 6 sub-processes and conditional gateways, etc, which also takes processing time ofc. But as far as I understand your comment is that you have in your test scenario only three tasks to complete is this right?

Since we receive 1 request for processing approximately every second and processing takes 3-5 seconds I think we can estimate that we run 3-5 processes concurrently.

Thanks that helps.

Can you share in more detail what you’re doing in the sequence flow conditions? Any big computation with feel expressions? I can see in your example this some op in field_4 satisfies op = "my_condition_4" Not sure whether this is what you’re actually doing right now.

Since you are right now in benchmarking stage, would it possible to adjust the test/process to verify where the issue lies? Like removing the conditions or simplifying them. It would be interesting to me to understand whether this might be an issue with the variables.

just before sending a task to Zeebe metric timer is started and it’s stopped on response callback (method whenComplete() in java code)

Just for clarification. You send the CreateProcessInstance with await result to the Broker. you measure the start before and wait for the result correct?

For workers, you can see timing (please ignore 0.99 and 0.999 outliers, that seems like a DB timeout, not related to Zeebe)
How it was measured: timer is started as soon as possible in @ZeebeWorker method and stopped in the final block (basically on return to broker).

Ok, this is measuring your client code correct? So your workers take 0.5 sec? Or what unit is this?

Initial JSON contains about 30 fields on different nestedness levels and takes ~1.5KB. In later phases, it may grow. I don’t have exact numbers but a rough estimate is about 50-70 >fields (or about 200 lines of formatted JSON) that take 7-10Kb

Thanks. Ok please be aware that this might affect the processing, activation time and potentially also evaluation time of expressions. Did you try to filter out variables for your workers, this might help with the job activation. Useful when a worker doesn’t want or doesn’t need to see all variables.

Look forward to your reply.

Greets
Chris

vasyl · November 28, 2022, 7:30am

Hi again, I’ll try to answer the questions that are closer to my specialization. @cyberpank will answer others a bit later

We have a few microservices. The first one doesn’t have workers, it just creates tasks for Zeebe, measures execution time, logs errors, etc. A few other services have workers, In this specific test case we have 2 microservices with workers - the first one is running worker A and worker N, and the second run worker B. There is one more service that registers a few workers that are never used in this flow.

Yes, that’s correct

We have a field “field_4” in Zeebe variables. It’s an array of strings and if it contains the value “my_condition_4” it means that the condition is true and we need workers in the corresponding subprocess. I can’t share details of our case, but I can think of a similar/fake implementation. Imagine that “field_4” holds types of notification and values are [“email”, “push”, “message”]. In this case, “email” subprocess will generate an email body and will send it, “push” will send a push notification, and so on. A flow that is activated by “phone call” won’t run because there is no such value in the request.
I’m sorry if my explanation is lame

Well, yes we can adjust some things, but only to some extent to keep the code in the working state. If you’re talking about conditions - I’m not sure how to simplify it even more because right now all of them are simple “contains” which must be a cheap operation and only one is true.

Yes, we start the timer just before we create an instance and wait to stop the timer in whenComplete() method. I’m not quite sure how it’s implemented under the hood, but it works better than blocking “join()” that we’ve been using earlier

The first metric measure overall execution time which is workers execution time + zeebe processing. On other (worker) graphs we measure only our business logic. That’s why we’re concerned in the first place - total execution time is 3-5 seconds, but workers take only ~1-1.5 seconds.

We’ve been trying to as little data to each worker as possible, but over time some shortcuts have been taken and now it’s not always good. In this particular example, the last worker doesn’t need all the info from workers A and B, but I’m not sure that we can remove variables from the flow once they have been sent to workers. It’s also a bit weird to me because it doesn’t seem like a huge enough amount of data that can slow down Zeebe by a few seconds even on a moderate load

Thank you!

Zelldon · November 29, 2022, 4:09am

Hey @vasyl

thanks for your response. I hope your colleague can answer the other questions above, like the replication factor and why you have 10 partitions instead of 8.

Could you share your worker configurations? What is the poll interval, the timeouts and maxActivationJobs etc.

In order to understand what is the limiting factor and what influences the performance I would propose changing some parts in your benchmark and verifying whether this makes any difference. I would suggest you do it one by one, to see what makes an effect. The following are some suggestions to reduce the blast radius and to try to pinpoint the issue.

Turn down the additional (not needed) workers. Reasoning: workers in general influence the processing performance since they sent every once in while activation commands (which need to be processed).
Reduce the variables, in workers via fetchVariables. Reasoning: Variables need to be read from the state and sent to each worker, limiting this helps to reduce processing time and network load.
It would be interesting whether it makes a difference for your benchmark if you introduce a root variable, under which you store the complete payload. Like: you have now {a: 2, b: 3, c: 4} and change it to {root:{a: 2, b: 3, c: 4}}. Reasoning: Root variables need to be resolved one by one, If you have only one it might be faster.
Replace the conditions just with =true which should remove the feel expression evaluation overhead.

Hope that helps. Let me know if you have any questions.

Greets
Chris