I am running a Zeebe cluster with the following configuration:
3 brokers, each allocated 3GB of memory
6 partitions with a replication factor of 3
During a recent load test, I observed the following behavior regarding memory utilization:
Pre-test: Broker process memory usage was approximately 1.25GB per broker.
During the test: Memory usage gradually increased to around 2.84GB. Each workflow instance during the test took around 8 minutes to complete.
Post-test: After the test concluded, I left the system in an idle state (with no active workload) for 24 hours. However, I observed that the process memory did not decrease or reclaim itself.
Restart: After several hours of idle operation, one broker unexpectedly restarted. Post-restart, the memory usage stabilized at around 1.4GB.
Unfortunately, due to the restart of the Zeebe pod, I am unable to retrieve the logs from the previous pod instance to determine the exact cause of the restart.
My questions are:
Why didn’t the broker’s process memory decrease after the load test, even when the system was idle?
What could have triggered the broker restart after being idle for several hours?
Given that I lost access to the previous pod logs due to the restart, is there a recommended approach to capture and persist logs for future troubleshooting?
I have updated the Zeebe broker configuration as suggested, setting ZEEBE_BROKER_DATA_ROCKSDB_COLUMNFAMILYOPTIONS_WRITE_BUFFER_SIZE to 8MB and ZEEBE_BROKER_DATA_ROCKSDB_COLUMNFAMILYOPTIONS_MAX_WRITE_BUFFER_SIZE_TO_MAINTAIN to 8MB.
I ran the same load test, which completed on the 13th of October. After the run, I left the system idle with no load. However, after two days, Zeebe broker1 restarted again. I have attached the process memory usage from Grafana for further analysis.