Zeebe Broker Memory Retention Post-Load Test and Unexpected Restart

Hello,

I am running a Zeebe cluster with the following configuration:

  • 3 brokers, each allocated 3GB of memory
  • 6 partitions with a replication factor of 3

During a recent load test, I observed the following behavior regarding memory utilization:

  • Pre-test: Broker process memory usage was approximately 1.25GB per broker.
  • During the test: Memory usage gradually increased to around 2.84GB. Each workflow instance during the test took around 8 minutes to complete.
  • Post-test: After the test concluded, I left the system in an idle state (with no active workload) for 24 hours. However, I observed that the process memory did not decrease or reclaim itself.
  • Restart: After several hours of idle operation, one broker unexpectedly restarted. Post-restart, the memory usage stabilized at around 1.4GB.

Unfortunately, due to the restart of the Zeebe pod, I am unable to retrieve the logs from the previous pod instance to determine the exact cause of the restart.

My questions are:

  1. Why didn’t the broker’s process memory decrease after the load test, even when the system was idle?
  2. What could have triggered the broker restart after being idle for several hours?
  3. Given that I lost access to the previous pod logs due to the restart, is there a recommended approach to capture and persist logs for future troubleshooting?

Did you make any tuning before running the load test?

As you said you allocated 3GB, during the load test 2.85GB, did not come down, it might have increased, kubernetes forced to restart the POD.

You could forward the logs to centralised logging or remote system using any log forwarder like fluentd to store the logs.

Would you please share the details if possible.

  1. What is your Camunda Version?
  2. What’s your kubernetes version?
  3. Are you using native kubernetes cluster? If not what is your k8s platform like AKS/GCP/
  4. What are the tuning settings you applied before load test?
  5. Are you monitoring k8 platform with any tools?

Here are the details you asked for:

  • Camunda Version: 8.4.9
  • Kubernetes Version: 1.3.0
  • Kubernetes Platform: AWS EKS
  • Tuning Settings: No tuning settings applied; we are using the default settings from here.
  • Monitoring Tools: Yes, we are monitoring the Kubernetes platform with Grafana and CloudWatch.

To clarify, after the load test, the memory usage fluctuated between 2.76GB and 2.85GB per broker.

Thanks for sharing the details.

Zeebe uses RocksDB to maintain the state. It was restarted due to OOM from RocksDB.

By default Zeebe uses
ZEEBE_BROKER_DATA_ROCKSDB_COLUMNFAMILYOPTIONS_WRITE_BUFFER_SIZE (default = 64MB)
ZEEBE_BROKER_DATA_ROCKSDB_COLUMNFAMILYOPTIONS_MAX_WRITE_BUFFER_SIZE_TO_MAINTAIN (default = 128MB)

Tune the above values and run the load test again to see the behavior.

Add the values to zeebe configuration.

  • name: ZEEBE_BROKER_DATA_ROCKSDB_COLUMNFAMILYOPTIONS_WRITE_BUFFER_SIZE
    value: 8MB
    • name: ZEEBE_BROKER_DATA_ROCKSDB_COLUMNFAMILYOPTIONS_MAX_WRITE_BUFFER_SIZE_TO_MAINTAIN
      value: 8MB

Please do test with the results and let us know the results.

I have updated the Zeebe broker configuration as suggested, setting ZEEBE_BROKER_DATA_ROCKSDB_COLUMNFAMILYOPTIONS_WRITE_BUFFER_SIZE to 8MB and ZEEBE_BROKER_DATA_ROCKSDB_COLUMNFAMILYOPTIONS_MAX_WRITE_BUFFER_SIZE_TO_MAINTAIN to 8MB.

I ran the same load test, which completed on the 13th of October. After the run, I left the system idle with no load. However, after two days, Zeebe broker1 restarted again. I have attached the process memory usage from Grafana for further analysis.

Did you capture the logs this time from the broker?

What’s your AWS Kubernetes node? Ubuntu or Amazon Linux? Did you check the journalctl logs on the native host system?