What exactly is required to get automated history cleanup to run?

Megan_Lovett · April 24, 2020, 5:45pm

We’re running the docker distribution of Camunda, 7.11, and upon noticing our DB storage usage was rising beyond our comfort level, started investigating how to clean up historic data.
First I discovered that none of our workflows had a TTL set, so I’ve updated those. I’ve updated that, and process instances created since are filling in the removal time. It didn’t seem like the cleanup job was actually being run though.
I also discovered that the default configuration of the docker distribution didn’t seem to run any jobs. I’ve created batch jobs in the past and they never seemed to run, but wasn’t sure why. My current cleanup investigation found this page [1] which notes that embedded process engines don’t run jobs by default. It’s not explicit, but I’m assuming the docker distribution counts as “embedded”? So I updated our deployment to use a custom bpm-platform.xml which adds <property name="jobExecutorActivate">true</property>, and also add a default TTL, but this didn’t seem to help.
I’ve also manually backfilled the removal_time_ column from all of the history tables (act_hi_*) in our database, but that doesn’t seem to be having any effect either.
Jobs still don’t seem to execute, and the “due date” of the history cleanup job seems to constantly update to be just slightly in the future by a minute or so. I tried updating the due date of the cleanup job to the past, but it didn’t seem to do anything, and it was quickly updated back into the future. Is there any way to verify that jobs are actually running? I’ve checked the logs of the Camunda container, and there are log lines for the JobExecutor saying it’s “starting to acquire jobs”, but it doesn’t seem to ever make any progress on any of those jobs.

[1] The Job Executor | docs.camunda.org

Sorry for the double post, I posted the first one before I was finished (a mistaken ctrl+enter) and didn’t want replies to a half-written post.

jwarren · April 24, 2020, 8:05pm

Hi Megan - with regards to the history cleanup, there are two ways to clean up the history as are outlined here https://docs.camunda.org/manual/7.11/user-guide/process-engine/history/#history-cleanup. They are “end-time based” and “removal-time based”. Removal is the default configuration if one is not provided, but it does require there to be a TTL configured in order for Camunda to set the removal time attribute in the DB when the instance completes, so it makes sense that only new instances are getting this attribute set. However, you can set the TTL on old process definitions via the REST API (https://docs.camunda.org/manual/7.11/reference/rest/process-definition/put-history-time-to-live/). In general, I would suggest avoiding setting the removal time attribute directly.

Do you have a “cleanup window” configured? This also needs to be set in order to tell Camunda when to run the batch cleanup and is advised to do it during off hours to avoid adding additional load to the system during business hours (if this is a concern for you).

As far as the job executor, do you see any errors in the logs? Can you provide some more details on the types of jobs in your processes (are you using timers?) and how these process definitions are deployed?

Megan_Lovett · April 24, 2020, 8:27pm

I tried using that API to set TTL, but the REST call would time out before completing and didn’t seem to have any effect either. I forgot to mention that in my initial post.
I’m currently using (nearly) the default cleanup window. The default bpm-platform.xml sets the window at 00:01-00:00 the next day. I’ve only changed that to 00:02-00:00 to verify that my changed configuration was being applied. So the cleanup window should be nearly 24/7.
Our process definitions do not have any timers. They’re a series of external-tasks which are executed by a service which is deployed alongside Camunda in a kubernetes cluster. We have multiple containers with an identical (and default) configuration, and a single Camunda container which our service does not interact with, but is used to host the UI, due to an issue with login sessions in the cockpit. Only the single UI container has the non-default configuration I’ve mentioned.
I don’t see any errors in the logs, and in fact nothing at all in the logs after the initial service startup.

jwarren · April 24, 2020, 9:07pm

It is possible that if you have a large number of instances in your DB that the REST call will timeout before it is able to update all of them with the new TTL. If this is the case, your best (only?) option may be to switch to using the “end-time” based strategy to clean up old jobs first, before switching back to “removal-time” based.

Have you tried changing the window from 00:01 - 23:59? Docs show that 00:00 is a default value and I’m not sure that it will recognize that configuration as the following day.

So if I understand, when a process of yours is started, the token will get stuck on the external task because the job executor never picks it up? Can you share your application.properties (or .yaml) file as well as your bpm-platform.xml file and the filepath? Have you confirmed it is being copied properly into the Docker container. Also what Database and version are you using? See here for something related Activated JobExecutor, but still not active/working while using MariaDB (Docker)?

Megan_Lovett · April 24, 2020, 10:01pm

I’ll try switching to the “end-time” strategy and updating the window to 00:01-23:59 as you’ve suggested. Those changes make sense with your explanation.

Our processes actually run fine. The external tasks run and the processes complete as expected. I have also confirmed that the bpm-platform.xml file is being copied into the docker container, as the cleanup job schedule was correctly displaying the change in start time from 00:01 to 00:02, as I mentioned earlier.

We’re using Postgres version 10 with Camunda version 7.11.

Megan_Lovett · April 27, 2020, 4:11pm

Unfortunately that didn’t work either. Here is the full bpm-platform.xml we’re using, in case that helps. Additionally the following environment variables are set: DB_DRIVER, DB_CONN_MAXACTIVE, DB_VALIDATE_ON_BORROW, DB_VALIDATION_QUERY, DB_URL, DB_USERNAME, DB_PASSWORD. Our deployment of Camunda seems to work fine in all respects except background jobs like this just don’t ever seem to run or make any progress.

<?xml version="1.0" encoding="UTF-8"?>
<bpm-platform xmlns="http://www.camunda.org/schema/1.0/BpmPlatform" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.camunda.org/schema/1.0/BpmPlatform http://www.camunda.org/schema/1.0/BpmPlatform ">

  <job-executor>
    <job-acquisition name="default" />
  </job-executor>

  <process-engine name="default">
    <job-acquisition>default</job-acquisition>
    <configuration>org.camunda.bpm.engine.impl.cfg.StandaloneProcessEngineConfiguration</configuration>
    <datasource>java:jdbc/ProcessEngine</datasource>

    <properties>
      <property name="history">full</property>
      <property name="databaseSchemaUpdate">true</property>
      <property name="authorizationEnabled">true</property>
      <property name="jobExecutorActivate">true</property>
      <property name="jobExecutorDeploymentAware">true</property>
      <property name="historyCleanupBatchWindowStartTime">00:02</property>
      <property name="historyCleanupBatchWindowEndTime">23:59</property>
      <property name="historyCleanupStrategy">endTimeBased</property>
      <property name="historyTimeToLive">P30D</property>
    </properties>

    <plugins>
      <!-- plugin enabling Process Application event listener support -->
      <plugin>
        <class>org.camunda.bpm.application.impl.event.ProcessApplicationEventListenerPlugin</class>
      </plugin>

      <!-- plugin enabling integration of camunda Spin -->
      <plugin>
        <class>org.camunda.spin.plugin.impl.SpinProcessEnginePlugin</class>
      </plugin>

      <!-- plugin enabling connect support -->
      <plugin>
        <class>org.camunda.connect.plugin.impl.ConnectProcessEnginePlugin</class>
      </plugin>

    </plugins>


  </process-engine>

</bpm-platform>

jwarren · April 27, 2020, 6:10pm

How are your processes deployed? The only other thing I can think of is that perhaps some of the older processes were not “registered” with the engine. You have jobExecutorDeploymentAware set to true which means that it will only pick up jobs that are registered with the engine. If you set it to false, it should pick up all jobs. See this thread Deployment-Aware Job Executor for more details.

Megan_Lovett · April 27, 2020, 11:24pm

I made that change and it didn’t seem to make any difference, so I dug deeper into our DB. Based on the contents of act_ru_job, the job is running, but the runs typically don’t have anything to delete. I wondered how that could be true so I inspected some of our tables. It seems the job has been cleaning up as I would like, but it just doesn’t seem to be doing anything on some of the tables. The table act_hi_procisnt only has about a month of data, as expected (some a bit older due to long running processes), but the tables act_hi_detail, act_hi_actinst, act_hi_varinst, and act_hi_ext_task_log all seem to have data from process instances which aren’t in act_hi_procinst. Unfortunately those 4 tables are also responsible for the vast majority of the storage used by our database. I have to assume this isn’t expected behavior. Am I going to have to manually delete this old data? Shouldn’t the history cleanup have deleted this data along with the process instances?
Also I’d like to thank you for the help. Even as the resident camunda expert on my team, I clearly have a lot to learn.

jwarren · April 28, 2020, 2:58am

Happy to help, the history cleanup can definitely be tricky. So you’re right, that is not expected behavior. My concern is that some of the instance data may have been cleaned up when you manually set the REMOVAL_TIME_ attribute directly and had cleanup set to “removal-time” based. With “end-time” based strategy, the cleanup is dependent on pulling the end-time from the ACT_HI_PROCINST_ table first before removing data in related tables. This is why it takes slightly longer than “removal-time” based strategy, as stated in the docs. So if only the data in the instance table somehow got cleaned up, but the rest of the data is still there, I’m not sure the “end-time” based strategy will work…

While it is not ideal (or recommended), you now may have to manually set the REMOVAL_TIME_ and change the strategy back to “removal-time” based to clean the rest of it up.

The expected behavior for “removal-time” based is that data from any HI table with a REMOVAL_TIME_ date set, that is past the current time (when the job runs) will get removed, regardless of the instance hierarchy. The cleanup assumes the REMOVAL_TIME_ was set because a TTL was configured and then Camunda has automatically sets the REMOVAL_TIME_ on all tables associated with that instance, so there is no need to worry about the hierarchy.

The expected behavior for “end-time” based is that Camunda first queries the ACT_HI_PROCINST_ table to check for instances with an end time after the configured history TTL. Then it joins across all the tables by instance id to clean up all data associated with those instances.

Megan_Lovett · May 12, 2020, 8:08pm

Rather than switch to REMOVAL_TIME strategy and then back to END_TIME, I chose to just manually remove the orphan data. It seemed like a more direct approach, and considering I had already manually modified the data it seemed more likely to work, rather than hoping Camunda could cleanup after the mess I made. The END_TIME strategy seems to be working successfully now that that’s been done; our DB growth has stopped.
Additionally, one thing I wasn’t initially aware of, but which definitely led to confusion, is how PostgreSQL handles table growth. Once it claims space from the OS and allocates it to a table, it won’t free that space even if rows are “deleted”. I think in retrospect that was also a factor in some of the confusion I was experiencing. So the cleanup job was actually working sooner than I thought, I was just reading the DB metrics and interpreting them incorrectly.
Thanks again for all of your help. I’ve marked as a solution the comment that (in retrospect) got our cleanup job working properly.