Dmitriy Zhezlov: Hi, can I calculate how much disk space a broker needs?
Nicolas: Hi Dmitriy, this depends on a few things:
-
What’s the configured
logSegmentSize
? Default is512MB
, but I would recommend a smaller size, between64MB
and128MB
. The smaller the segments, the more data can be deleted after compaction. However, very small segments means more I/O overhead, so you should find a good balance. -
What’s the configured
maxMessageSize
? This influences how big the payloads/variables can get. Again a smaller size means you will store less data, but it also means you get a performance boost. Having amaxMessageSize
which is aligned to your CPU arch and fits nicely in L1-L2 cache will get you much better performance -
How many concurrent in-flight workflow instances do you expect to handle? This affects the size of the log projection, i.e. the engine’s state “database”. The more workflows instances you have in flight simultaneously, the bigger the state. Zeebe uses RocksDB for as an hybrid in-memory/on-disk embedded database, which means it can spill out to disk if it gets too big, but it will use disk space. It also uses additional disk spaces for the snapshots.
With all of these, you can estimate the size a single partition will need: -
There is always at least one segment, but you should expect at least two since compaction is asynchronous. So
logSegmentSize * 2
. -
Your state size is roughly the size of your
maxMessageSize * numberOfInFlightWorkflows
. This is a very rough estimate, since actually a workflow instance can use more space thanmaxMessageSize
in the state, so you probably want to double this to be generous. -
There’s usually at least one snapshot, but during replication there’s a short window where there will be two: we first ensure the new one was properly written before deleting the old one, so you should provision for
3 * sizeOfState
(sizeOfState
being as calculated above).
That’s roughly per partition. There’s a bit of overhead to all partitions for metadata, so you can also add a few mega bytes on top. Then you have to multiply this by the number of partitions that your broker will have.
If you’re on 0.25+, I recommend tuning the disk watermarks configuration so that you can safely recover before going out of disk space. The default values are meant for local development, and should not be used for production.
Dmitriy Zhezlov: Thanks
Nicolas: If you run with exporters, you may end up with more than two segments since we only compact data once it’s been both processed and exported. So if you export to, say, Elasticsearch, but it’s currently down for whatever reason, then we’ll keep processing commands and so on, but the data will not be deleted until Elasticsearch has acknowledged it. This is were the watermarks come in handy, as they will stop processing once you reach the low watermark, but will keep exporting - when Elastic comes back up, it will export, then compact, then start processing again. In the mean time, your clients would receive RESOURCES_EXHAUSTED
(as you ran out of disk space, sort of)
Dmitriy Zhezlov: Should I add replication factor in this formula?
Nicolas: The number of partitions and the replication factor will tell you how many partitions there will be per broker. Sometimes it will be evenly distributed, but sometimes not. Usually I would configure all brokers the same, so I would pick whichever node will have the highest number of partitions, and use that as my worst case scenario. So if you have, say, 6 nodes, 8 partitions, and replication factor 3, you would get the following distribution:
P\N| N 0| N 1| N 2| N 3| N 4| N 5
P 0| L | F | F | - | - | -
P 1| - | L | F | F | - | -
P 2| - | - | L | F | F | -
P 3| - | - | - | L | F | F
P 4| F | - | - | - | L | F
P 5| F | F | - | - | - | L
P 6| L | F | F | - | - | -
P 7| - | L | F | F | - | -
Partitions per Node:
N 0: 4
N 1: 5
N 2: 5
N 3: 4
N 4: 3
N 5: 3
So when estimating disk size, I’d calculate the size of a partition and multiply by 5 as that’s the highest number of partitions a single broker will have.
Nicolas: (where N is node, and P is partition, L is leader, and F is follower - note the roles are guesses, they can’t be guaranteed, but the partition distribution is static)
Dmitriy Zhezlov: sizeOfState - is it logSegmentSize?
Nicolas: Not quite - the log is a series of events, which you can think of as a sequence of changes to be applied to your state. The state is an aggregation of this log. It’s similar to event sourcing (but not quite as strict). So every event is processed, and applied to the state. This implies the log is immutable (it’s append only) and the state is mutable, so they’re stored differently.
Nicolas: sizeOfState refers to the state “database”, i.e. RocksDB. This collects all the in-flight workflow instances and related data.
Nicolas: The logSegmentSize refers to the smallest unit of the log. So the size of your log is logSegmentSize * numberOfSegments
.
Nicolas: The size of your partition roughly is sizeOfState + sizeOfLog
Dmitriy Zhezlov: Is there a generator for distribution table ?
Josh Wulf: Not yet, but feel free to make one and become famous!
Note: This post was generated by Slack Archivist from a conversation in the Zeebe Slack, a source of valuable discussions on Zeebe (get an invite). Someone in the Slack thought this was worth sharing!
If this post answered a question for you, hit the Like button - we use that to assess which posts to put into docs.