Locks on db on cleanup job execution (multi pod deployment)

pzadorovskyi · June 12, 2025, 1:18pm

Hi camunda community. Our team use camunda 7.19.0 with multi pods deployment - use 8 pods on prod. Our system provides a huge load on db that why was decided to switch to async history writes using kafka. Also we have configured ttls for history cleanup jobs:

generic-properties:
      properties:
        historyTimeToLive: "P2D"
        historyCleanupEnabled: true
        historyCleanupBatchSize: 500 
        historyRemovalTimeStrategy: "start"
        batchOperationHistoryTimeToLive: "P2D"
        historyCleanupBatchWindowStartTime: "01:00"
        historyCleanupBatchWindowEndTime: "08:00"
        historyCleanupStrategy: removalTimeBased
        historyCleanupJobLogTimeToLive: "P2D"

One of the problems that we faced - we started receiving locks on the db side during the execution of the delete queries, especially locks on ACT_GE_BYTEARRAY - the biggest table in our database. Also, we noticed that there are a lot of records with REMOVAL_TIME_ in the past that were not deleted. Maybe we missed something in cleanup configuration of job execution?

aravindhrs · June 17, 2025, 8:15am

Hi @pzadorovskyi, Here are several key points to check and tune:

1. Confirm Cleanup Jobs Are Executing Properly

Camunda cleanup jobs are triggered by a scheduled job executor. Make sure:

Job executor is enabled in each pod (check job-execution settings).
Only one node is acquiring the cleanup job—you might be seeing lock contention if multiple nodes attempt the same batch job.

Use this query to check for pending cleanup jobs:

SELECT * FROM ACT_RU_JOB WHERE HANDLER_TYPE_ = 'history-cleanup';

And for failed jobs:

SELECT * FROM ACT_RU_JOB WHERE EXCEPTION_MSG_ IS NOT NULL;

2. Tune Cleanup Batch Sizes and Windows

Batch Size:

You’re using historyCleanupBatchSize: 500, which may be too high, especially with large payloads.
Try reducing to 100–200 and monitor locking behavior.

Batch Window:

Runs from 01:00 to 08:00. That’s good, but ensure:
- Cleanup jobs are actually running in that window.
- The load on the DB during this time is otherwise low.

3. ACT_GE_BYTEARRAY Locking

This table stores serialized variables and payloads, and deleting from it can be slow, especially with foreign key constraints.

Tips:

Ensure proper indexing on ACT_GE_BYTEARRAY and all foreign key columns.
Check if cascading deletes are causing large transactions—manually deleting in smaller slices may help.
Consider partitioning the table (if supported by your DB).

4. Check `REMOVAL_TIME_` Propagation

Objects are only deleted if:

REMOVAL_TIME_ is set on all related entities (e.g., process instance, variable instance, job log, etc.).
If a bytearray record isn’t linked to a record with a valid REMOVAL_TIME_, it won’t be cleaned up.

Run:

SELECT COUNT(*) FROM ACT_HI_PROCINST WHERE REMOVAL_TIME_ IS NULL;

Also:

SELECT COUNT(*) FROM ACT_HI_VARINST WHERE REMOVAL_TIME_ IS NULL;

If many rows are missing REMOVAL_TIME_, you might need to trigger a recalculation job:

managementService.recalculateRemovalTime(ProcessInstance.class, processInstanceId);

Or set:

properties:
  removalTimeStrategy: end
  enableHistoricInstancePermissions: true

5. Check Kafka Async History Setup

When using async history:

Deletion can only happen after async writes are completed.
Ensure the Kafka history handler writes data promptly and doesn’t delay REMOVAL_TIME_ propagation.
Check consumer lags and flush rates.

6. Upgrade or Patch

If possible:

Upgrade to Camunda 7.20.x or 7.21.x, where async history and cleanup logic have more fixes and optimizations.
Review relevant Camunda JIRA tickets (like CAM-12362, CAM-12547, etc.) for history cleanup issues.

Suggested Next Steps

Lower historyCleanupBatchSize and monitor DB locks.
Verify that REMOVAL_TIME_ is properly set across all historic tables.
Check that only one node runs cleanup jobs at a time.
Inspect indexes and FK constraints on ACT_GE_BYTEARRAY.
Monitor async history flush to DB via Kafka.

aravindhrs · June 17, 2025, 8:18am

Hi @pzadorovskyi , Please check below.

SQL Diagnostic Checklist for Camunda History Cleanup

These queries help diagnose why cleanup is not progressing and where locking or missing configuration might be occurring.

1. Pending Cleanup Jobs

SELECT * FROM ACT_RU_JOB WHERE HANDLER_TYPE_ = 'history-cleanup';

2. Failed Cleanup Jobs

SELECT * FROM ACT_RU_JOB WHERE HANDLER_TYPE_ = 'history-cleanup' AND EXCEPTION_MSG_ IS NOT NULL;

3. Historic Process Instances Not Marked for Deletion

SELECT COUNT(*) FROM ACT_HI_PROCINST WHERE REMOVAL_TIME_ IS NULL;

4. Variables Without Removal Time

SELECT COUNT(*) FROM ACT_HI_VARINST WHERE REMOVAL_TIME_ IS NULL;

5. Bytearrays Not Linked to Active Processes

This gives you “orphaned” byte arrays that may not be cleaned up.

SELECT COUNT(*) FROM ACT_GE_BYTEARRAY b
LEFT JOIN ACT_HI_VARINST v ON b.ID_ = v.BYTEARRAY_ID_
WHERE v.BYTEARRAY_ID_ IS NULL;

6. Historic Bytearrays Still in DB

To monitor bloat:

SELECT COUNT(*) FROM ACT_GE_BYTEARRAY;

7. History Cleanup Job Lock Contention

See how long cleanup jobs are running or waiting:

SELECT ID_, JOB_DEF_ID_, DUEDATE_, LOCK_EXP_TIME_, RETRIES_, EXCLUSIVE_, SUSPENSION_STATE_
FROM ACT_RU_JOB
WHERE HANDLER_TYPE_ = 'history-cleanup';

Cleanup Tuning Script (Camunda 7.19 YAML)

Below is a tuned application.yaml block with more conservative cleanup settings and enforced history cleanup behavior:

camunda:
  bpm:
    history-cleanup:
      batch-size: 100         # Lower to reduce DB load
      strategy: removalTimeBased
      removal-time-strategy: start
      batch-window-start-time: "02:00"
      batch-window-end-time: "06:00"
    generic-properties:
      properties:
        historyCleanupEnabled: true
        historyCleanupBatchWindowStartTime: "02:00"
        historyCleanupBatchWindowEndTime: "06:00"
        historyCleanupStrategy: removalTimeBased
        historyRemovalTimeStrategy: "start"
        batchOperationHistoryTimeToLive: "P2D"
        historyTimeToLive: "P2D"
        historyCleanupJobLogTimeToLive: "P2D"
        enableHistoricInstancePermissions: true  # Ensures correct permission propagation

Additional Tuning Tips

Run Cleanup Job Manually for Testing:

managementService.toggleHistoryCleanupJobLogEnabled(false);
managementService.cleanUpHistoryAsync();

Enable Logging for History Cleanup:
To debug history cleanup:

logging.level.org.camunda.bpm.engine.impl.history.cleanup=DEBUG

Tune Kafka Consumer Lag (If Applicable):
Ensure your async Kafka consumer for history is not lagging behind. Use metrics/logs from your Kafka monitoring (Prometheus or Kafka-UI).