Locks on db on cleanup job execution (multi pod deployment)

Hi camunda community. Our team use camunda 7.19.0 with multi pods deployment - use 8 pods on prod. Our system provides a huge load on db that why was decided to switch to async history writes using kafka. Also we have configured ttls for history cleanup jobs:

generic-properties:
      properties:
        historyTimeToLive: "P2D"
        historyCleanupEnabled: true
        historyCleanupBatchSize: 500 
        historyRemovalTimeStrategy: "start"
        batchOperationHistoryTimeToLive: "P2D"
        historyCleanupBatchWindowStartTime: "01:00"
        historyCleanupBatchWindowEndTime: "08:00"
        historyCleanupStrategy: removalTimeBased
        historyCleanupJobLogTimeToLive: "P2D"

One of the problems that we faced - we started receiving locks on the db side during the execution of the delete queries, especially locks on ACT_GE_BYTEARRAY - the biggest table in our database. Also, we noticed that there are a lot of records with REMOVAL_TIME_ in the past that were not deleted. Maybe we missed something in cleanup configuration of job execution?

Hi @pzadorovskyi, Here are several key points to check and tune:

:white_check_mark: 1. Confirm Cleanup Jobs Are Executing Properly

Camunda cleanup jobs are triggered by a scheduled job executor. Make sure:

  • Job executor is enabled in each pod (check job-execution settings).
  • Only one node is acquiring the cleanup job—you might be seeing lock contention if multiple nodes attempt the same batch job.

Use this query to check for pending cleanup jobs:

SELECT * FROM ACT_RU_JOB WHERE HANDLER_TYPE_ = 'history-cleanup';

And for failed jobs:

SELECT * FROM ACT_RU_JOB WHERE EXCEPTION_MSG_ IS NOT NULL;

:white_check_mark: 2. Tune Cleanup Batch Sizes and Windows

Batch Size:

  • You’re using historyCleanupBatchSize: 500, which may be too high, especially with large payloads.
  • Try reducing to 100–200 and monitor locking behavior.

Batch Window:

  • Runs from 01:00 to 08:00. That’s good, but ensure:

    • Cleanup jobs are actually running in that window.
    • The load on the DB during this time is otherwise low.

:white_check_mark: 3. ACT_GE_BYTEARRAY Locking

This table stores serialized variables and payloads, and deleting from it can be slow, especially with foreign key constraints.

Tips:

  • Ensure proper indexing on ACT_GE_BYTEARRAY and all foreign key columns.
  • Check if cascading deletes are causing large transactions—manually deleting in smaller slices may help.
  • Consider partitioning the table (if supported by your DB).

:white_check_mark: 4. Check REMOVAL_TIME_ Propagation

Objects are only deleted if:

  • REMOVAL_TIME_ is set on all related entities (e.g., process instance, variable instance, job log, etc.).
  • If a bytearray record isn’t linked to a record with a valid REMOVAL_TIME_, it won’t be cleaned up.

Run:

SELECT COUNT(*) FROM ACT_HI_PROCINST WHERE REMOVAL_TIME_ IS NULL;

Also:

SELECT COUNT(*) FROM ACT_HI_VARINST WHERE REMOVAL_TIME_ IS NULL;

If many rows are missing REMOVAL_TIME_, you might need to trigger a recalculation job:

managementService.recalculateRemovalTime(ProcessInstance.class, processInstanceId);

Or set:

properties:
  removalTimeStrategy: end
  enableHistoricInstancePermissions: true

:white_check_mark: 5. Check Kafka Async History Setup

When using async history:

  • Deletion can only happen after async writes are completed.
  • Ensure the Kafka history handler writes data promptly and doesn’t delay REMOVAL_TIME_ propagation.
  • Check consumer lags and flush rates.

:white_check_mark: 6. Upgrade or Patch

If possible:

  • Upgrade to Camunda 7.20.x or 7.21.x, where async history and cleanup logic have more fixes and optimizations.
  • Review relevant Camunda JIRA tickets (like CAM-12362, CAM-12547, etc.) for history cleanup issues.

:mag: Suggested Next Steps

  • Lower historyCleanupBatchSize and monitor DB locks.
  • Verify that REMOVAL_TIME_ is properly set across all historic tables.
  • Check that only one node runs cleanup jobs at a time.
  • Inspect indexes and FK constraints on ACT_GE_BYTEARRAY.
  • Monitor async history flush to DB via Kafka.

Hi @pzadorovskyi , Please check below.

:white_check_mark: SQL Diagnostic Checklist for Camunda History Cleanup

These queries help diagnose why cleanup is not progressing and where locking or missing configuration might be occurring.


1. Pending Cleanup Jobs

SELECT * FROM ACT_RU_JOB WHERE HANDLER_TYPE_ = 'history-cleanup';

2. Failed Cleanup Jobs

SELECT * FROM ACT_RU_JOB WHERE HANDLER_TYPE_ = 'history-cleanup' AND EXCEPTION_MSG_ IS NOT NULL;

3. Historic Process Instances Not Marked for Deletion

SELECT COUNT(*) FROM ACT_HI_PROCINST WHERE REMOVAL_TIME_ IS NULL;

4. Variables Without Removal Time

SELECT COUNT(*) FROM ACT_HI_VARINST WHERE REMOVAL_TIME_ IS NULL;

5. Bytearrays Not Linked to Active Processes

This gives you “orphaned” byte arrays that may not be cleaned up.

SELECT COUNT(*) FROM ACT_GE_BYTEARRAY b
LEFT JOIN ACT_HI_VARINST v ON b.ID_ = v.BYTEARRAY_ID_
WHERE v.BYTEARRAY_ID_ IS NULL;

6. Historic Bytearrays Still in DB

To monitor bloat:

SELECT COUNT(*) FROM ACT_GE_BYTEARRAY;

7. History Cleanup Job Lock Contention

See how long cleanup jobs are running or waiting:

SELECT ID_, JOB_DEF_ID_, DUEDATE_, LOCK_EXP_TIME_, RETRIES_, EXCLUSIVE_, SUSPENSION_STATE_
FROM ACT_RU_JOB
WHERE HANDLER_TYPE_ = 'history-cleanup';

:hammer_and_wrench: Cleanup Tuning Script (Camunda 7.19 YAML)

Below is a tuned application.yaml block with more conservative cleanup settings and enforced history cleanup behavior:

camunda:
  bpm:
    history-cleanup:
      batch-size: 100         # Lower to reduce DB load
      strategy: removalTimeBased
      removal-time-strategy: start
      batch-window-start-time: "02:00"
      batch-window-end-time: "06:00"
    generic-properties:
      properties:
        historyCleanupEnabled: true
        historyCleanupBatchWindowStartTime: "02:00"
        historyCleanupBatchWindowEndTime: "06:00"
        historyCleanupStrategy: removalTimeBased
        historyRemovalTimeStrategy: "start"
        batchOperationHistoryTimeToLive: "P2D"
        historyTimeToLive: "P2D"
        historyCleanupJobLogTimeToLive: "P2D"
        enableHistoricInstancePermissions: true  # Ensures correct permission propagation

:brain: Additional Tuning Tips

  1. Run Cleanup Job Manually for Testing:
managementService.toggleHistoryCleanupJobLogEnabled(false);
managementService.cleanUpHistoryAsync();
  1. Enable Logging for History Cleanup:
    To debug history cleanup:
logging.level.org.camunda.bpm.engine.impl.history.cleanup=DEBUG
  1. Tune Kafka Consumer Lag (If Applicable):
    Ensure your async Kafka consumer for history is not lagging behind. Use metrics/logs from your Kafka monitoring (Prometheus or Kafka-UI).