Thanks @jwulf
Your advice does make sense, but it’s quite comprehensive and I believe that it mostly describes performance testing in general, not focusing on Zeebe nuances. It also suggests changing/abandoning parts of business logic to isolate perf testing from as many side effects as possible. Unfortunately, we simply don’t have enough time to run exhaustive testing (at least right now) to investigate the impact of different parameters on the system & find the optimal configuration. What we’ve been asking for is a kind of hint on where to look for a problem to narrow down the scope of the issue.
Anyway, this investigation did give some results that I’d like to share with you and the community as probably some people may find it useful. After a closer investigation of grafana dashboard, we’ve noticed a correlation between “Commit Latency” metric and the time between workers, so our assumption was that communication with RocksDB is the weak link. According to documentation RocksDB writes update log not only to in-memory memtable, but also log file on SSD and this seems to be a culprit.
At this point, we’ve found that a very similar issue was already discussed here Single Broker commit latency is unreasonably high · Issue #8132 · camunda/zeebe · GitHub, but the problem with this solution is that Falko is using property that is never mentioned in the complete list of properties in the official doc Configuration | Camunda Platform 8. To solve this we’ve found a pull request #5576 that documents an experimental property (it’s been experimental for almost 2 years?).
Probably switching off this piece of functionality is not the best solution as it may affect the resiliency of Zeebe processes, but running a cluster version of Zeebe with replicas may trade off any saved time for a gateway to broker communication and lead to another set of performance issue.
Another option that did help us was “zeebe.client.job.pollinterval”. A default value of 100ms is a bit too high in our case because some workers need 60-80ms, so waiting for 100ms seems a bit wasteful. We’ll keep experimenting, but my guess is that somewhere between 10 and 25ms would be a sweet spot in our case.
With all these changes we’ve managed to rise efficiency from 10-15% to ~80%. In other words, if the whole process takes ~250ms workers consume ~200ms.
To sum up
- Standalone broker with an embedded gateway may use ssd too extensively and slow down processing. A solution could be a switch to a cluster w/ replication, switching off flushes, or mounting RocksDB file system as tmpfs which is essential a memory
- Decreasing client poll interval to start workers faster
Sorry if my answer is a bit wordy, I’m just trying to be helpful. Please feel free to correct my mistakes above as I’m sure my superficial analysis is not really good for a person who really understands what is happening under the hood.