ClientStatusException: Expected to execute the command on one of the partitions, but all failed — Zeebe partitions at 94% disk usage

Hi,

We are encountering a critical issue in our Camunda 8 / Zeebe cluster and would appreciate any guidance.

Error

We are receiving the following error when trying to execute commands against Zeebe:

errorCode: io.camunda.zeebe.client.api.command.ClientStatusException: 
Expected to execute the command on one of the partitions, but all failed; 
there are no more partitions available to retry. Please try again. 
If the error persists contact your zeebe operator.

Environment Details

  • Zeebe Disk Usage: ~94% across partitions

  • Camunda Version: 8.8

  • Number of Brokers: 3

Observations

  • The error started occurring when disk usage on the Zeebe partitions reached approximately 94%.

  • All partitions appear to be rejecting commands simultaneously.

  • No single partition is available to accept and process commands.

What We’ve Tried

  1. Checked disk usage on all Zeebe broker nodes — confirmed usage is at ~94%.

  2. Reviewed Zeebe broker logs for additional errors or warnings.

  3. Attempted to restart brokers, but the issue persists.

Questions

  1. Is the 94% disk usage the root cause? Does Zeebe enforce a disk usage threshold beyond which it stops accepting commands on partitions? If so, what is the default threshold?

  2. What is the recommended way to recover? Can we safely free up disk space (e.g., by triggering compaction, deleting old snapshots, or increasing disk size) while the cluster is in this state?

  3. How can we prevent this in the future? Are there best practices for configuring disk usage alerts or setting Zeebe’s diskUsageCommandWatermark and diskUsageReplicationWatermark thresholds?

  4. Is there a way to force Zeebe to resume processing once disk space is freed, or does it recover automatically?

Any insights, similar experiences, or documentation pointers would be greatly appreciated. Thank you!

Your issue is caused by Zeebe’s disk usage watermarks - when free space drops below the freeSpace.processing threshold (default 2GB), all partitions reject commands to prevent disk exhaustion. I found the following relevant resources:

Does this help? If not, can anyone from the community jump in? :waving_hand:


:light_bulb: Hints: Use the Ask AI feature in Camunda’s documentation to chat with AI and get fast help. Report bugs and features in Camuda’s GitHub issue tracker. Trust the process. :robot: