Hi camunda community. Our team use camunda 7.19.0 with multi pods deployment - use 8 pods on prod. Our system provides a huge load on db that why was decided to switch to async history writes using kafka. Also we have configured ttls for history cleanup jobs:
generic-properties:
properties:
historyTimeToLive: "P2D"
historyCleanupEnabled: true
historyCleanupBatchSize: 500
historyRemovalTimeStrategy: "start"
batchOperationHistoryTimeToLive: "P2D"
historyCleanupBatchWindowStartTime: "01:00"
historyCleanupBatchWindowEndTime: "08:00"
historyCleanupStrategy: removalTimeBased
historyCleanupJobLogTimeToLive: "P2D"
One of the problems that we faced - we started receiving locks on the db side during the execution of the delete queries, especially locks on ACT_GE_BYTEARRAY - the biggest table in our database. Also, we noticed that there are a lot of records with REMOVAL_TIME_ in the past that were not deleted. Maybe we missed something in cleanup configuration of job execution?
Hi @pzadorovskyi, Here are several key points to check and tune:
1. Confirm Cleanup Jobs Are Executing Properly
Camunda cleanup jobs are triggered by a scheduled job executor. Make sure:
- Job executor is enabled in each pod (check
job-execution
settings).
- Only one node is acquiring the cleanup job—you might be seeing lock contention if multiple nodes attempt the same batch job.
Use this query to check for pending cleanup jobs:
SELECT * FROM ACT_RU_JOB WHERE HANDLER_TYPE_ = 'history-cleanup';
And for failed jobs:
SELECT * FROM ACT_RU_JOB WHERE EXCEPTION_MSG_ IS NOT NULL;
2. Tune Cleanup Batch Sizes and Windows
Batch Size:
- You’re using
historyCleanupBatchSize: 500
, which may be too high, especially with large payloads.
- Try reducing to 100–200 and monitor locking behavior.
Batch Window:
3. ACT_GE_BYTEARRAY Locking
This table stores serialized variables and payloads, and deleting from it can be slow, especially with foreign key constraints.
Tips:
- Ensure proper indexing on
ACT_GE_BYTEARRAY
and all foreign key columns.
- Check if cascading deletes are causing large transactions—manually deleting in smaller slices may help.
- Consider partitioning the table (if supported by your DB).
4. Check REMOVAL_TIME_
Propagation
Objects are only deleted if:
REMOVAL_TIME_
is set on all related entities (e.g., process instance, variable instance, job log, etc.).
- If a bytearray record isn’t linked to a record with a valid
REMOVAL_TIME_
, it won’t be cleaned up.
Run:
SELECT COUNT(*) FROM ACT_HI_PROCINST WHERE REMOVAL_TIME_ IS NULL;
Also:
SELECT COUNT(*) FROM ACT_HI_VARINST WHERE REMOVAL_TIME_ IS NULL;
If many rows are missing REMOVAL_TIME_
, you might need to trigger a recalculation job:
managementService.recalculateRemovalTime(ProcessInstance.class, processInstanceId);
Or set:
properties:
removalTimeStrategy: end
enableHistoricInstancePermissions: true
5. Check Kafka Async History Setup
When using async history:
- Deletion can only happen after async writes are completed.
- Ensure the Kafka history handler writes data promptly and doesn’t delay
REMOVAL_TIME_
propagation.
- Check consumer lags and flush rates.
6. Upgrade or Patch
If possible:
- Upgrade to Camunda 7.20.x or 7.21.x, where async history and cleanup logic have more fixes and optimizations.
- Review relevant Camunda JIRA tickets (like
CAM-12362
, CAM-12547
, etc.) for history cleanup issues.
Suggested Next Steps
- Lower
historyCleanupBatchSize
and monitor DB locks.
- Verify that
REMOVAL_TIME_
is properly set across all historic tables.
- Check that only one node runs cleanup jobs at a time.
- Inspect indexes and FK constraints on
ACT_GE_BYTEARRAY
.
- Monitor async history flush to DB via Kafka.
SQL Diagnostic Checklist for Camunda History Cleanup
These queries help diagnose why cleanup is not progressing and where locking or missing configuration might be occurring.
1. Pending Cleanup Jobs
SELECT * FROM ACT_RU_JOB WHERE HANDLER_TYPE_ = 'history-cleanup';
2. Failed Cleanup Jobs
SELECT * FROM ACT_RU_JOB WHERE HANDLER_TYPE_ = 'history-cleanup' AND EXCEPTION_MSG_ IS NOT NULL;
3. Historic Process Instances Not Marked for Deletion
SELECT COUNT(*) FROM ACT_HI_PROCINST WHERE REMOVAL_TIME_ IS NULL;
4. Variables Without Removal Time
SELECT COUNT(*) FROM ACT_HI_VARINST WHERE REMOVAL_TIME_ IS NULL;
5. Bytearrays Not Linked to Active Processes
This gives you “orphaned” byte arrays that may not be cleaned up.
SELECT COUNT(*) FROM ACT_GE_BYTEARRAY b
LEFT JOIN ACT_HI_VARINST v ON b.ID_ = v.BYTEARRAY_ID_
WHERE v.BYTEARRAY_ID_ IS NULL;
6. Historic Bytearrays Still in DB
To monitor bloat:
SELECT COUNT(*) FROM ACT_GE_BYTEARRAY;
7. History Cleanup Job Lock Contention
See how long cleanup jobs are running or waiting:
SELECT ID_, JOB_DEF_ID_, DUEDATE_, LOCK_EXP_TIME_, RETRIES_, EXCLUSIVE_, SUSPENSION_STATE_
FROM ACT_RU_JOB
WHERE HANDLER_TYPE_ = 'history-cleanup';
Cleanup Tuning Script (Camunda 7.19 YAML)
Below is a tuned application.yaml
block with more conservative cleanup settings and enforced history cleanup behavior:
camunda:
bpm:
history-cleanup:
batch-size: 100 # Lower to reduce DB load
strategy: removalTimeBased
removal-time-strategy: start
batch-window-start-time: "02:00"
batch-window-end-time: "06:00"
generic-properties:
properties:
historyCleanupEnabled: true
historyCleanupBatchWindowStartTime: "02:00"
historyCleanupBatchWindowEndTime: "06:00"
historyCleanupStrategy: removalTimeBased
historyRemovalTimeStrategy: "start"
batchOperationHistoryTimeToLive: "P2D"
historyTimeToLive: "P2D"
historyCleanupJobLogTimeToLive: "P2D"
enableHistoricInstancePermissions: true # Ensures correct permission propagation
Additional Tuning Tips
- Run Cleanup Job Manually for Testing:
managementService.toggleHistoryCleanupJobLogEnabled(false);
managementService.cleanUpHistoryAsync();
- Enable Logging for History Cleanup:
To debug history cleanup:
logging.level.org.camunda.bpm.engine.impl.history.cleanup=DEBUG
- Tune Kafka Consumer Lag (If Applicable):
Ensure your async Kafka consumer for history is not lagging behind. Use metrics/logs from your Kafka monitoring (Prometheus or Kafka-UI).