How do I calculate the disk space needed for a broker?

Dmitriy Zhezlov: Hi, can I calculate how much disk space a broker needs?

Nicolas: Hi Dmitriy, this depends on a few things:

  1. What’s the configured logSegmentSize ? Default is 512MB, but I would recommend a smaller size, between 64MB and 128MB. The smaller the segments, the more data can be deleted after compaction. However, very small segments means more I/O overhead, so you should find a good balance.

  2. What’s the configured maxMessageSize? This influences how big the payloads/variables can get. Again a smaller size means you will store less data, but it also means you get a performance boost. Having a maxMessageSize which is aligned to your CPU arch and fits nicely in L1-L2 cache will get you much better performance :slightly_smiling_face:

  3. How many concurrent in-flight workflow instances do you expect to handle? This affects the size of the log projection, i.e. the engine’s state “database”. The more workflows instances you have in flight simultaneously, the bigger the state. Zeebe uses RocksDB for as an hybrid in-memory/on-disk embedded database, which means it can spill out to disk if it gets too big, but it will use disk space. It also uses additional disk spaces for the snapshots.
    With all of these, you can estimate the size a single partition will need:

  4. There is always at least one segment, but you should expect at least two since compaction is asynchronous. So logSegmentSize * 2.

  5. Your state size is roughly the size of your maxMessageSize * numberOfInFlightWorkflows . This is a very rough estimate, since actually a workflow instance can use more space than maxMessageSize in the state, so you probably want to double this to be generous.

  6. There’s usually at least one snapshot, but during replication there’s a short window where there will be two: we first ensure the new one was properly written before deleting the old one, so you should provision for 3 * sizeOfState (sizeOfState being as calculated above).
    That’s roughly per partition. There’s a bit of overhead to all partitions for metadata, so you can also add a few mega bytes on top. Then you have to multiply this by the number of partitions that your broker will have.

If you’re on 0.25+, I recommend tuning the disk watermarks configuration so that you can safely recover before going out of disk space. The default values are meant for local development, and should not be used for production.

Dmitriy Zhezlov: Thanks :+1:

Nicolas: If you run with exporters, you may end up with more than two segments since we only compact data once it’s been both processed and exported. So if you export to, say, Elasticsearch, but it’s currently down for whatever reason, then we’ll keep processing commands and so on, but the data will not be deleted until Elasticsearch has acknowledged it. This is were the watermarks come in handy, as they will stop processing once you reach the low watermark, but will keep exporting - when Elastic comes back up, it will export, then compact, then start processing again. In the mean time, your clients would receive RESOURCES_EXHAUSTED (as you ran out of disk space, sort of)

Dmitriy Zhezlov: Should I add replication factor in this formula?

Nicolas: The number of partitions and the replication factor will tell you how many partitions there will be per broker. Sometimes it will be evenly distributed, but sometimes not. Usually I would configure all brokers the same, so I would pick whichever node will have the highest number of partitions, and use that as my worst case scenario. So if you have, say, 6 nodes, 8 partitions, and replication factor 3, you would get the following distribution:

P\N|	N 0|	N 1|	N 2|	N 3|	N 4|	N 5
P 0|	L  |	F  |	F  |	-  |	-  |	-  
P 1|	-  |	L  |	F  |	F  |	-  |	-  
P 2|	-  |	-  |	L  |	F  |	F  |	-  
P 3|	-  |	-  |	-  |	L  |	F  |	F  
P 4|	F  |	-  |	-  |	-  |	L  |	F  
P 5|	F  |	F  |	-  |	-  |	-  |	L  
P 6|	L  |	F  |	F  |	-  |	-  |	-  
P 7|	-  |	L  |	F  |	F  |	-  |	-  
Partitions per Node:
N 0: 4
N 1: 5
N 2: 5
N 3: 4
N 4: 3
N 5: 3

So when estimating disk size, I’d calculate the size of a partition and multiply by 5 as that’s the highest number of partitions a single broker will have.

Nicolas: (where N is node, and P is partition, L is leader, and F is follower - note the roles are guesses, they can’t be guaranteed, but the partition distribution is static)

Dmitriy Zhezlov: sizeOfState - is it logSegmentSize?

Nicolas: Not quite - the log is a series of events, which you can think of as a sequence of changes to be applied to your state. The state is an aggregation of this log. It’s similar to event sourcing (but not quite as strict). So every event is processed, and applied to the state. This implies the log is immutable (it’s append only) and the state is mutable, so they’re stored differently.

Nicolas: sizeOfState refers to the state “database”, i.e. RocksDB. This collects all the in-flight workflow instances and related data.

Nicolas: The logSegmentSize refers to the smallest unit of the log. So the size of your log is logSegmentSize * numberOfSegments.

Nicolas: The size of your partition roughly is sizeOfState + sizeOfLog

Dmitriy Zhezlov: Is there a generator for distribution table ?

Josh Wulf: Not yet, but feel free to make one and become famous!

Note: This post was generated by Slack Archivist from a conversation in the Zeebe Slack, a source of valuable discussions on Zeebe (get an invite). Someone in the Slack thought this was worth sharing!

If this post answered a question for you, hit the Like button - we use that to assess which posts to put into docs.

2 Likes

zell: I created one once see here https://github.com/zeebe-io/zeebe/tree/develop/benchmarks/docs/scripts#partitiondistributionsh|https://github.com/zeebe-io/zeebe/tree/develop/benchmarks/docs/scripts#partitiondistributionsh @Josh Wulf <@U016Y8TQVDH>

zell: Really nice answers @Nicolas we should put that in the docs or faq :smiley:

Dmitriy Zhezlov: Thx!

Dmitriy Zhezlov: Do I understand it right?
I want to calculate disk space for 1 broker and 1 partitions
I have:

  • logSegmentSize = 128mb (by Nicolas recommendations)
  • maxMessageSize = 4MB (by default, for example)
  • numberOfInFlightWorkflows = 1000

After calculating I will have:

  • logSegmentSize * 2 = 256MB
  • sizeOfState * 3 = (maxMessageSize * numberOfInFlightWorkflows) * 3 = 12000MB

Now, for 1 broker and 1 partitions I should have - logSegmentSize + sizeOfState = 12256MB (12GB), It is correctly?

Nicolas: Roughly, yes. But that would be if you really use 4MB per workflow instances, which I hope is not the case :sweat_smile:

Dmitriy Zhezlov: It`s default, but Is it small or big?

Nicolas: The biggest space used by instances will be your variables, so I highly recommend sending just references/small payloads and storing larger data somewhere else. So for example, pass around a path to your S3 bucket or redis queue and store the larger data there.

Nicolas: It’s quite big, and we will probably reduce it in the future.

Nicolas: I would recommend much, much smaller, something like 64KB or 128KB - however like I mentioned, this will also limit how many records can be batched, and therefore will limit how big the collections your multi-instance elements can handle =/

Nicolas: This is something we’ll change in the future, but it’s currently the state unfortunately

Dmitriy Zhezlov: I understand, we will test this parameter for our processes

Nicolas: So if you feel confident, you could leave it to 4MB and still pass around small payloads - you can then estimate roughly how much your workflows will use based on the variables you pass (pick an upper bound), and then use that instead of maxMessageSize to calculate how much space you need.

Dmitriy Zhezlov: If I have 1000 numberOfInFlightWorkflows, and I want to use 10 partitions, Can I say that

every partitions will have 100 numberOfInFlightWorkflows (numberOfInFlightWorkflows / countPartitions)

Nicolas: Another option is to start with a lot of disk space, then run your load for a while, and check how big the state/snapshots are. You can find out by running du -f /usr/local/zeebe/data/raft-partitions/1/runtime (this is the state, where 1 is partition 1 - you would replace it by 2 for partition 2, etc.) and du -f /usr/local/zeebe/data/raft-partitions/1/snapshots (where snapshots are stored - this includes both pending and committed snapshots).

Nicolas: Yes more or less.

Nicolas: It’s not scientifically proven or anything, but we do try to round-robin these :sweat_smile:

Dmitriy Zhezlov: Thank @Nicolas for these formulas and for this script for tables @zell!

1 Like