Low Zeebe broker performance even with not high load processing

cyberpank · August 16, 2022, 9:24am

Hello,

We faced low Zeebe broker performance even with not high load processing.
For the test we’ve used 2 services (1 node each), the first one (shown in yellow below) has 2 workers, the second one (shown in green below) has 1 worker.
Functionality was tested with 10 concurrent users, each one sending 25 requests (the whole test takes up to 30 sec in JMeter)
Unfortunately, we can’t share details of the business flow, but in general, it’s a few conditions where each one may trigger a few workers. Somethings like
Start
If condition1 == true: do A, B, C
If condition2 == true: do X, Y
Finish

Here is a Zeebe configuration

      io.camunda.zeebe.broker.system - Starting broker 0 with configuration {
  "network" : {
    "host" : "0.0.0.0",
    "portOffset" : 0,
    "maxMessageSize" : "4MB",
    "advertisedHost" : "0.0.0.0",
    "commandApi" : {
      "host" : "0.0.0.0",
      "port" : 26501,
      "advertisedHost" : "0.0.0.0",
      "advertisedPort" : 26501,
      "address" : "0.0.0.0:26501",
      "advertisedAddress" : "0.0.0.0:26501"
    },
    "internalApi" : {
      "host" : "0.0.0.0",
      "port" : 26502,
      "advertisedHost" : "0.0.0.0",
      "advertisedPort" : 26502,
      "address" : "0.0.0.0:26502",
      "advertisedAddress" : "0.0.0.0:26502"
    },
    "security" : {
      "enabled" : false,
      "certificateChainPath" : null,
      "privateKeyPath" : null
    },
    "maxMessageSizeInBytes" : 4194304
  },
  "cluster" : {
    "initialContactPoints" : [ "zeebe-0.zeebe.dev.svc.cluster.local:26502" ],
    "partitionIds" : [ 1 ],
    "nodeId" : 0,
    "partitionsCount" : 1,
    "replicationFactor" : 1,
    "clusterSize" : 1,
    "clusterName" : "zeebe-cluster",
    "heartbeatInterval" : "PT0.25S",
    "electionTimeout" : "PT2.5S",
    "membership" : {
      "broadcastUpdates" : false,
      "broadcastDisputes" : true,
      "notifySuspect" : false,
      "gossipInterval" : "PT0.25S",
      "gossipFanout" : 2,
      "probeInterval" : "PT1S",
      "probeTimeout" : "PT0.1S",
      "suspectProbes" : 3,
      "failureTimeout" : "PT10S",
      "syncInterval" : "PT10S"
    },
    "raft" : {
      "enablePriorityElection" : true
    },
    "messageCompression" : "NONE"
  },
  "threads" : {
    "cpuThreadCount" : 2,
    "ioThreadCount" : 2
  },
  "data" : {
    "directory" : "/usr/local/zeebe/data",
    "logSegmentSize" : "512MB",
    "snapshotPeriod" : "PT15M",
    "logIndexDensity" : 100,
    "diskUsageMonitoringEnabled" : true,
    "diskUsageReplicationWatermark" : 0.99,
    "diskUsageCommandWatermark" : 0.97,
    "diskUsageMonitoringInterval" : "PT1S",
    "freeDiskSpaceReplicationWatermark" : 41604219,
    "logSegmentSizeInBytes" : 536870912,
    "freeDiskSpaceCommandWatermark" : 124812657
  },
  "exporters" : {
    "hazelcast" : {
      "jarPath" : "/tmp/zeebe-hazelcast-exporter-1.2.1-jar-with-dependencies.jar",
      "className" : "io.zeebe.hazelcast.exporter.HazelcastExporter",
      "args" : null,
      "external" : true
    }
  },
  "gateway" : {
    "network" : {
      "host" : "0.0.0.0",
      "port" : 26500,
      "minKeepAliveInterval" : "PT30S"
    },
    "cluster" : {
      "contactPoint" : "0.0.0.0:26502",
      "requestTimeout" : "PT15S",
      "clusterName" : "zeebe-cluster",
      "memberId" : "gateway",
      "host" : "0.0.0.0",
      "port" : 26502,
      "membership" : {
        "broadcastUpdates" : false,
        "broadcastDisputes" : true,
        "notifySuspect" : false,
        "gossipInterval" : "PT0.25S",
        "gossipFanout" : 2,
        "probeInterval" : "PT1S",
        "probeTimeout" : "PT0.1S",
        "suspectProbes" : 3,
        "failureTimeout" : "PT10S",
        "syncInterval" : "PT10S"
      },
      "security" : {
        "enabled" : false,
        "certificateChainPath" : null,
        "privateKeyPath" : null
      },
      "messageCompression" : "NONE"
    },
    "threads" : {
      "managementThreads" : 1
    },
    "security" : {
      "enabled" : false,
      "certificateChainPath" : null,
      "privateKeyPath" : null
    },
    "longPolling" : {
      "enabled" : true
    },
    "interceptors" : [ ],
    "initialized" : true,
    "enable" : true
  },
  "backpressure" : {
    "enabled" : true,
    "algorithm" : "FIXED",
    "aimd" : {
      "requestTimeout" : "PT1S",
      "initialLimit" : 100,
      "minLimit" : 1,
      "maxLimit" : 1000,
      "backoffRatio" : 0.9
    },
    "fixed" : {
      "limit" : 20
    },
    "vegas" : {
      "alpha" : 3,
      "beta" : 6,
      "initialLimit" : 20
    },
    "gradient" : {
      "minLimit" : 10,
      "initialLimit" : 20,
      "rttTolerance" : 2.0
    },
    "gradient2" : {
      "minLimit" : 10,
      "initialLimit" : 20,
      "rttTolerance" : 2.0,
      "longWindow" : 600
    }
  },
  "experimental" : {
    "maxAppendsPerFollower" : 2,
    "maxAppendBatchSize" : "32KB",
    "disableExplicitRaftFlush" : false,
    "rocksdb" : {
      "columnFamilyOptions" : { },
      "enableStatistics" : false,
      "memoryLimit" : "512MB",
      "maxOpenFiles" : -1,
      "maxWriteBufferNumber" : 6,
      "minWriteBufferNumberToMerge" : 3,
      "ioRateBytesPerSecond" : 0,
      "disableWal" : false
    },
    "raft" : {
      "requestTimeout" : "PT5S",
      "maxQuorumResponseTimeout" : "PT0S",
      "minStepDownFailureCount" : 3,
      "preferSnapshotReplicationThreshold" : 100,
      "preallocateSegmentFiles" : true
    },
    "partitioning" : {
      "scheme" : "ROUND_ROBIN",
      "fixed" : [ ]
    },
    "queryApi" : {
      "enabled" : false
    },
    "consistencyChecks" : {
      "enablePreconditions" : false,
      "enableForeignKeyChecks" : false,
      "settings" : {
        "enablePreconditions" : false,
        "enableForeignKeyChecks" : false
      }
    },
    "features" : {
      "enableYieldingDueDateChecker" : false
    },
    "maxAppendBatchSizeInBytes" : 32768
  },
  "executionMetricsExporterEnabled" : false
}

The problem is that although each worker is relatively fast the process, in general, takes quite a lot of time

Picture 1 illustrates time distributions within the processed instance.

The orange bar at the top starts when we send a request to zeebe to create a new process instance and ends with the response from zeebe (we use synchronous calls).
Three bars below depict the time consumed by 3 workers that were used in this instance. Each worker takes 58 to 115 ms, or ~265ms in total which is ~10-15% of the entire time.
This is just one example, but it’s pretty typical.

To dive deeper we’ve tried to use Simple Zeebe monitor and see what’s going on inside the process (the process is different from the screenshot above because http tracing doesn’t show relations to zeebe process but the timing is always pretty much the same).

To simplify understanding please see an attached plot that shows that transition to each new state takes approximately the same amount of time and although each transition is not critically slow (usually up to 100ms) with ~100 steps it makes the overall process slow.

Picture 1 1.png - Google Drive
Picture 2 2.png - Google Drive

So the question is how we can decrease the time of executing these steps needed to change every state? Is it possible to run them in parallel or are there any configurations to improve the performance in this part?

Thanks in advance.

jwulf · August 16, 2022, 9:33am

Hi @cyberpank

Have you tried profiling with different memory and CPU configurations for the brokers?

You are measuring the end-to-end processing time of each process, I take it.

I would measure the theoretical minimum E2E of the process by running one worker for each task, and starting one process and measuring the E2E time. It is never going to be faster than this.

Then try ramping up concurrent requests, and see how the time changes as the numbers ramp up.

Then see what effect CPU and memory resources have.

See also these posts:

Josh

vasyl · August 18, 2022, 6:06am

Thanks @jwulf

Your advice does make sense, but it’s quite comprehensive and I believe that it mostly describes performance testing in general, not focusing on Zeebe nuances. It also suggests changing/abandoning parts of business logic to isolate perf testing from as many side effects as possible. Unfortunately, we simply don’t have enough time to run exhaustive testing (at least right now) to investigate the impact of different parameters on the system & find the optimal configuration. What we’ve been asking for is a kind of hint on where to look for a problem to narrow down the scope of the issue.

Anyway, this investigation did give some results that I’d like to share with you and the community as probably some people may find it useful. After a closer investigation of grafana dashboard, we’ve noticed a correlation between “Commit Latency” metric and the time between workers, so our assumption was that communication with RocksDB is the weak link. According to documentation RocksDB writes update log not only to in-memory memtable, but also log file on SSD and this seems to be a culprit.

At this point, we’ve found that a very similar issue was already discussed here Single Broker commit latency is unreasonably high · Issue #8132 · camunda/zeebe · GitHub, but the problem with this solution is that Falko is using property that is never mentioned in the complete list of properties in the official doc Configuration | Camunda Platform 8. To solve this we’ve found a pull request #5576 that documents an experimental property (it’s been experimental for almost 2 years?).

Probably switching off this piece of functionality is not the best solution as it may affect the resiliency of Zeebe processes, but running a cluster version of Zeebe with replicas may trade off any saved time for a gateway to broker communication and lead to another set of performance issue.

Another option that did help us was “zeebe.client.job.pollinterval”. A default value of 100ms is a bit too high in our case because some workers need 60-80ms, so waiting for 100ms seems a bit wasteful. We’ll keep experimenting, but my guess is that somewhere between 10 and 25ms would be a sweet spot in our case.

With all these changes we’ve managed to rise efficiency from 10-15% to ~80%. In other words, if the whole process takes ~250ms workers consume ~200ms.

To sum up

Standalone broker with an embedded gateway may use ssd too extensively and slow down processing. A solution could be a switch to a cluster w/ replication, switching off flushes, or mounting RocksDB file system as tmpfs which is essential a memory
Decreasing client poll interval to start workers faster

Sorry if my answer is a bit wordy, I’m just trying to be helpful. Please feel free to correct my mistakes above as I’m sure my superficial analysis is not really good for a person who really understands what is happening under the hood.

Zelldon · August 18, 2022, 7:20am

Hey @vasyl

thanks for your post and that you share your findings!

I agree that mostly it is the disk which slows things down. But of course there are also other parameters which need to be considered. Maybe this post also helps here Questions about zeebe performance test - #2 by Zelldon

My normal advice would be to use a large enough SSD, use multiple partitions, turn on replication (as you have seen we currently have issues with replication factor one), give the brokers enough resources (per partition ~1-2 CPU). Make sure to also increase the CPU and io thread count, otherwise the available CPU’s aren’t used. Setup an standalone gateway to scale the gateway independently (and give it own resources). Embedded gateway will also slow down the broker. Reduce the worker count, many workers can also slow down processing since all of them will send regularly Activation commands.

I recommend to use our helm charts to setup a Zeebe cluster https://helm.camunda.io/ here you can find also already quite good defaults, and can configure them more easily.

Hope that helps.