Zeebe broker error: "Failed to write command MESSAGE_SUBSCRIPTION CREATE … error = FULL"

Hi everyone,

I am running a self-managed Camunda 8 (Zeebe) cluster in production.
Recently my broker started throwing the following error continuously:

Failed to write command MESSAGE_SUBSCRIPTION CREATE from 0 to logstream (error = FULL)

I cannot understand what exactly causes the “logstream FULL” state.
Here is my setup:

  • 8 partitions
  • All brokers are healthy at startup
  • The issue appears only under load when creating many message subscriptions
  • I want to understand why partition 0 becomes FULL
  • Disk is not 100% full, but the broker still reports this error

My questions:

  1. What does error = FULL exactly mean in Zeebe?
    Is it related to disk, log segment limits, or Raft replication backpressure?
  2. Under which conditions does the broker refuse to write a command to the logstream?
  3. Can this happen if all messages (MESSAGE_SUBSCRIPTION CREATE) are routed to the same partition because of the correlation key?
  4. How can I correctly diagnose:
  • snapshot issues
  • log segment rotation
  • replication lag
  • partition overload
  1. What configuration changes are recommended to avoid this FULL state?

Here are additional details if needed (I can provide full config):

  • zeebe version: 8.5.7
  • deployment: Docker
  • disk usage: 25G/2T
  • broker log segment size: 256MB
  • snapshot period: 5m

Hi @Mirik-0722,

This is a classic case of partition overload and backpressure in Zeebe. Let me break down what’s happening and how to address it:

What does error = FULL mean?

The error = FULL indicates that the logstream buffer (internal writer buffer) for partition 0 is full. This happens when:

  1. Backpressure mechanism: Zeebe uses backpressure to prevent being overwhelmed. When the broker receives more requests than it can process with acceptable latency, it rejects new requests to keep processing latency low.

  2. Buffer saturation: The logstream buffer fills up when records are written faster than they can be processed and exported, creating a backlog.

Answering your specific questions:

1. What triggers the FULL state?

The FULL state is triggered by:

  • High write rate exceeding processing/exporting capacity
  • Slow exporters causing backlogs of unexported records
  • Partition overload due to uneven load distribution
  • Raft replication backpressure when followers can’t keep up

It’s not directly related to disk space (as you confirmed with 25G/2T usage).

2. Conditions for refusing commands:

Zeebe refuses to write commands when:

  • The number of in-flight requests exceeds configured limits
  • The logstream buffer is full due to processing/exporting bottlenecks
  • Write rate limiting is triggered based on exporting rate

3. Message subscription routing and partition 0:

Yes, this is very likely your issue!

Message subscriptions are routed to partitions based on the correlation key hash:

partition = hash(correlation_key) % partition_count

If many of your correlation keys hash to partition 0, you have a “hot partition” problem. This is especially common if:

  • Correlation keys follow predictable patterns
  • You’re using sequential IDs or similar values
  • Many processes use the same or similar correlation keys

4. Diagnostic approaches:

For snapshot issues:

  • Check snapshot frequency (your 5m setting is good)
  • Monitor if snapshots are completing successfully
  • Verify log segments are being deleted after snapshots

For log segment rotation:

  • Your 256MB segment size is reasonable
  • Ensure exporters are keeping up (slow exporters prevent segment deletion)
  • Monitor disk usage growth patterns

For replication lag:

  • Check if followers are keeping up with the leader
  • Monitor network latency between brokers
  • Verify sufficient disk I/O capacity

For partition overload:

  • Analyze correlation key distribution across partitions
  • Monitor per-partition metrics (if available)
  • Check if partition 0 consistently has higher load

5. Configuration recommendations:

Immediate actions:

  1. Analyze your correlation keys - Check if they’re creating uneven partition distribution
  2. Scale up resources - More CPU/memory for the overloaded broker
  3. Monitor exporters - Ensure they’re not the bottleneck

Configuration tuning:

# Increase partition count to better distribute load
zeebe:
  broker:
    cluster:
      partitionsCount: 16  # Consider doubling from 8

# Tune backpressure settings
zeebe:
  broker:
    backpressure:
      enabled: true
      # Adjust limits based on your workload

# Ensure adequate disk space thresholds
zeebe:
  broker:
    data:
      disk:
        freeSpace:
          processing: 2GB
          replication: 1GB

Client-side improvements:

  • Implement retry logic with exponential backoff for RESOURCE_EXHAUSTED errors
  • Consider diversifying correlation keys to improve partition distribution

Next steps:

  1. Immediate: Reduce load temporarily to stabilize the cluster
  2. Short-term: Analyze correlation key distribution and consider increasing partition count
  3. Long-term: Implement proper monitoring and alerting for partition-level metrics

Would you be able to share some examples of your correlation keys? This would help determine if partition distribution is indeed the root cause.

References:

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.