History cleanup java.sql.SQLException: Lock wait timeout exceeded

tobimat80 · November 3, 2017, 2:13pm

Hi,

We are using v7.7.0 and because we ran the engine with history level full for a while the history tables grew very big. It is changed to audit now and I am trying to get rid of some old data. I used the cleanup job call but there are some problems:

02-Nov-2017 21:56:54.957 SEVERE [Thread-4] org.camunda.commons.logging.BaseLogger.logError ENGINE-14019 Exception during job acquisition ENGINE-03004 Exception while executing Database Operation 'UPDATE EverLivingJobEntity[f4313116-b4c9-11e7-8f8c-dea7db31ad34]' with message '
### Error updating database.  Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
### The error may involve org.camunda.bpm.engine.impl.persistence.entity.JobEntity.updateEverLivingJob-Inline
### The error occurred while setting parameters
### SQL: update ACT_RU_JOB          SET REV_ =  ?,             EXECUTION_ID_ = ?,             LOCK_EXP_TIME_ = ?,             LOCK_OWNER_ = ?,             RETRIES_ = ?,             EXCEPTION_STACK_ID_ = ?,             EXCEPTION_MSG_ = ?,             DUEDATE_ = ?,             SUSPENSION_STATE_ = ?,             PROCESS_DEF_ID_ = ?,             PROCESS_DEF_KEY_ = ?,             JOB_DEF_ID_ = ?,             DEPLOYMENT_ID_ = ?,             HANDLER_CFG_ = ?,             PRIORITY_ = ?,             SEQUENCE_COUNTER_ = ?          where ID_= ?         and REV_ = ?
### Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction'. Flush summary:
 [
  UPDATE EverLivingJobEntity[f4313116-b4c9-11e7-8f8c-dea7db31ad34]
]
 org.camunda.bpm.engine.ProcessEngineException: ENGINE-03004 Exception while executing Database Operation 'UPDATE EverLivingJobEntity[f4313116-b4c9-11e7-8f8c-dea7db31ad34]' with message '
### Error updating database.  Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
### The error may involve org.camunda.bpm.engine.impl.persistence.entity.JobEntity.updateEverLivingJob-Inline
### The error occurred while setting parameters
### SQL: update ACT_RU_JOB          SET REV_ =  ?,             EXECUTION_ID_ = ?,             LOCK_EXP_TIME_ = ?,             LOCK_OWNER_ = ?,             RETRIES_ = ?,             EXCEPTION_STACK_ID_ = ?,             EXCEPTION_MSG_ = ?,             DUEDATE_ = ?,             SUSPENSION_STATE_ = ?,             PROCESS_DEF_ID_ = ?,             PROCESS_DEF_KEY_ = ?,             JOB_DEF_ID_ = ?,             DEPLOYMENT_ID_ = ?,             HANDLER_CFG_ = ?,             PRIORITY_ = ?,             SEQUENCE_COUNTER_ = ?          where ID_= ?         and REV_ = ?
### Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction'. Flush summary:
 [
  UPDATE EverLivingJobEntity[f4313116-b4c9-11e7-8f8c-dea7db31ad34]
]
        at org.camunda.bpm.engine.impl.db.EnginePersistenceLogger.flushDbOperationException(EnginePersistenceLogger.java:115)
        at org.camunda.bpm.engine.impl.db.entitymanager.DbEntityManager.flushDbOperationManager(DbEntityManager.java:305)
        at org.camunda.bpm.engine.impl.db.entitymanager.DbEntityManager.flush(DbEntityManager.java:281)
        at org.camunda.bpm.engine.impl.interceptor.CommandContext.flushSessions(CommandContext.java:203)
        at org.camunda.bpm.engine.impl.interceptor.CommandContext.close(CommandContext.java:132)
        at org.camunda.bpm.engine.impl.interceptor.CommandContextInterceptor.execute(CommandContextInterceptor.java:113)
        at org.camunda.bpm.engine.impl.interceptor.ProcessApplicationContextInterceptor.execute(ProcessApplicationContextInterceptor.java:66)
        at org.camunda.bpm.engine.impl.interceptor.LogInterceptor.execute(LogInterceptor.java:30)
        at org.camunda.bpm.engine.impl.jobexecutor.SequentialJobAcquisitionRunnable.acquireJobs(SequentialJobAcquisitionRunnable.java:142)
        at org.camunda.bpm.engine.impl.jobexecutor.SequentialJobAcquisitionRunnable.run(SequentialJobAcquisitionRunnable.java:60)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.ibatis.exceptions.PersistenceException:
### Error updating database.  Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
### The error may involve org.camunda.bpm.engine.impl.persistence.entity.JobEntity.updateEverLivingJob-Inline
### The error occurred while setting parameters
### SQL: update ACT_RU_JOB          SET REV_ =  ?,             EXECUTION_ID_ = ?,             LOCK_EXP_TIME_ = ?,             LOCK_OWNER_ = ?,             RETRIES_ = ?,             EXCEPTION_STACK_ID_ = ?,             EXCEPTION_MSG_ = ?,             DUEDATE_ = ?,             SUSPENSION_STATE_ = ?,             PROCESS_DEF_ID_ = ?,             PROCESS_DEF_KEY_ = ?,             JOB_DEF_ID_ = ?,             DEPLOYMENT_ID_ = ?,             HANDLER_CFG_ = ?,             PRIORITY_ = ?,             SEQUENCE_COUNTER_ = ?          where ID_= ?         and REV_ = ?
### Cause: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
        at org.apache.ibatis.exceptions.ExceptionFactory.wrapException(ExceptionFactory.java:26)
        at org.apache.ibatis.session.defaults.DefaultSqlSession.update(DefaultSqlSession.java:154)
        at org.camunda.bpm.engine.impl.db.sql.DbSqlSession.executeUpdate(DbSqlSession.java:239)
        at org.camunda.bpm.engine.impl.db.sql.DbSqlSession.updateEntity(DbSqlSession.java:218)
        at org.camunda.bpm.engine.impl.db.AbstractPersistenceSession.executeDbOperation(AbstractPersistenceSession.java:52)
        at org.camunda.bpm.engine.impl.db.entitymanager.DbEntityManager.flushDbOperationManager(DbEntityManager.java:303)
        ... 9 more
Caused by: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1074)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4074)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4006)
        at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2468)
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2629)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2719)
        at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2155)
        at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1379)
        at org.apache.ibatis.executor.statement.PreparedStatementHandler.update(PreparedStatementHandler.java:44)
        at org.apache.ibatis.executor.statement.RoutingStatementHandler.update(RoutingStatementHandler.java:69)
        at org.apache.ibatis.executor.SimpleExecutor.doUpdate(SimpleExecutor.java:48)
        at org.apache.ibatis.executor.BaseExecutor.update(BaseExecutor.java:105)
        at org.apache.ibatis.executor.CachingExecutor.update(CachingExecutor.java:71)
        at org.apache.ibatis.session.defaults.DefaultSqlSession.update(DefaultSqlSession.java:152)
        ... 13 more

So I am not sure if that worked or how to solve that.

Thx,

-Tobias

Yana · November 3, 2017, 2:48pm

Hi Tobias,

What is your setup? Do you run the platform in cluster? If that is the case is it homogeneous or heterogeneous?

Best regards,
Yana

tobimat80 · November 3, 2017, 2:52pm

No cluster setup here.

Thx,

-Tobias

Yana · November 6, 2017, 9:18am

Hello Tobias,

Which mysql version do you use and what is your configuration of the isolation level?
Also are you able to reproduce the issue or it appeared once?
If you are able to reproduce it - what are the steps and is it possible to check what is locking it?

Best regards,
Yana

tobimat80 · November 6, 2017, 4:09pm

Hi Yana,

We are using MySQL 5.7.17 and the isolation_level is read commited

So far I am not able to reproduce it - calling another cleanup gives us no errors in the log file.

The thing is - I am not sure if the cleanup was executed or not…

Thx,

-Tobias

sdorokhova · November 6, 2017, 4:25pm

Hi @tobimat80,

you should see the traces in historic job logs table (ACT_HI_JOB_LOG).
Also, if the metrics are enabled, you must also have records in metrics table (ACT_RU_METER_LOG).

Both of the are accessible via REST API.

tobimat80 · November 7, 2017, 11:21am

Thanks for the hint. But it seems it is not logged in ACT_HI_JOB_LOG

{
“type”: “InvalidRequestException”,
“message”: “Historic job log with id f4313116-b4c9-11e7-8f8c-dea7db31ad34 does not exist”
}

And in ACT_RU_METER_LOG I was not successful either

sdorokhova · November 7, 2017, 11:58am

Are you using Camunda 7.7?

Are you calling POST /history/cleanup to run history cleanup?

Can you post here the response from such call?

tobimat80 · November 7, 2017, 12:57pm

Yes, we upgraded a couple of weeks ago.

Here is the response:

{
“id”: “7147401”,
“jobDefinitionId”: null,
“processInstanceId”: null,
“processDefinitionId”: null,
“processDefinitionKey”: null,
“executionId”: null,
“exceptionMessage”: null,
“retries”: 3,
“dueDate”: “2017-11-07T10:40:42”,
“suspended”: false,
“priority”: 0,
“tenantId”: null
}

Best,

-Tobias

sdorokhova · November 7, 2017, 1:50pm

Do you have job execution enabled?
How did you requested ACT_HI_JOB_LOG table? Why do you have UUID style jobId in response, though your jobId is 7147401?

Can you try POST /history/job-log with payload { “jobId”: “7147401” } ?

Also can you check ACT_RU_JOB table now for HANDLER_TYPE_=“history-cleanup”? Can you post the whole DB record here?

tobimat80 · November 7, 2017, 2:11pm

Sorry, my bad. Was on the wrong system. Here is the correct response:

{
“id”: “f4313116-b4c9-11e7-8f8c-dea7db31ad34”,
“jobDefinitionId”: null,
“processInstanceId”: null,
“processDefinitionId”: null,
“processDefinitionKey”: null,
“executionId”: null,
“exceptionMessage”: null,
“retries”: 3,
“dueDate”: null,
“suspended”: true,
“priority”: 0,
“tenantId”: null
}

Job execution is enabled. Here is the job history log (now there is actually some data):

[
{
“id”: “f4313117-b4c9-11e7-8f8c-dea7db31ad34”,
“timestamp”: “2017-10-19T12:35:13”,
“jobId”: “f4313116-b4c9-11e7-8f8c-dea7db31ad34”,
“jobDueDate”: “2017-10-19T12:35:13”,
“jobRetries”: 3,
“jobPriority”: 0,
“jobExceptionMessage”: null,
“jobDefinitionId”: null,
“jobDefinitionType”: “history-cleanup”,
“jobDefinitionConfiguration”: null,
“activityId”: null,
“executionId”: null,
“processInstanceId”: null,
“processDefinitionId”: null,
“processDefinitionKey”: null,
“deploymentId”: null,
“tenantId”: null,
“creationLog”: true,
“failureLog”: false,
“successLog”: false,
“deletionLog”: false
},
{
“id”: “f48e6c5d-b4c9-11e7-8f8c-dea7db31ad34”,
“timestamp”: “2017-10-19T12:35:14”,
“jobId”: “f4313116-b4c9-11e7-8f8c-dea7db31ad34”,
“jobDueDate”: “2017-10-19T12:35:13”,
“jobRetries”: 3,
“jobPriority”: 0,
“jobExceptionMessage”: null,
“jobDefinitionId”: null,
“jobDefinitionType”: “history-cleanup”,
“jobDefinitionConfiguration”: null,
“activityId”: null,
“executionId”: null,
“processInstanceId”: null,
“processDefinitionId”: null,
“processDefinitionKey”: null,
“deploymentId”: null,
“tenantId”: null,
“creationLog”: false,
“failureLog”: false,
“successLog”: true,
“deletionLog”: false
}
]

And the table row:

{
“DEPLOYMENT_ID_”: “”,
“DUEDATE_”: “”,
“EXCEPTION_MSG_”: “”,
“EXCEPTION_STACK_ID_”: “”,
“EXCLUSIVE_”: 1,
“EXECUTION_ID_”: “”,
“HANDLER_CFG_”: “{"countEmptyRuns":0,"immediatelyDue":true}”,
“HANDLER_TYPE_”: “history-cleanup”,
“ID_”: “f4313116-b4c9-11e7-8f8c-dea7db31ad34”,
“JOB_DEF_ID_”: “”,
“LOCK_EXP_TIME_”: “”,
“LOCK_OWNER_”: “”,
“PRIORITY_”: 0,
“PROCESS_DEF_ID_”: “”,
“PROCESS_DEF_KEY_”: “”,
“PROCESS_INSTANCE_ID_”: “”,
“REPEAT_”: “”,
“RETRIES_”: 3,
“REV_”: 55,
“SEQUENCE_COUNTER_”: 26,
“SUSPENSION_STATE_”: 2,
“TENANT_ID_”: “”,
“TYPE_”: “ever-living”
}

Thx.

sdorokhova · November 7, 2017, 2:49pm

Ok, so you see now that the job was executed twice.

You can guess, whether anything was deleted either by checking the data itself or by checking the metrics (in case metrics are enabled in your configuration).

tobimat80 · November 7, 2017, 4:46pm

Yes, I see. I will try again and see if I still get the lock timeout exceptions. Thx!

-Tobias

tobimat80 · November 8, 2017, 1:25pm

Ok, now the task runs without any complaints. But I checked the size of the database tables and the historic data is still huge. I checked the disk space usage before and after I ran the cleanup script for the first time. There was not much difference either. As we do not need that historic data - is it save to truncate all ACT_HI_* tables? I will set the history level to NONE afterwards.

Thx,

-Tobias

sdorokhova · November 8, 2017, 1:45pm

Hi @tobimat80, in theory yes, but I never tried:) So please backup your database before doing this.

Also don’t forget that it can be some history related data in ACT_GE_BYTEARRAY table.

sdorokhova · November 8, 2017, 1:47pm

Have you configured History time to live for your process/case/decision definitions? You could put 0 there, and then all historic data must be deleted during history cleanup (except for historic batches, job logs and similar).

tobimat80 · November 8, 2017, 1:56pm

Yes, I set that to 120 days. My problem is the ACT_HI_EXT_TASK_LOG table. This one is roughly 70 GB in size (half of the size is used for the indexes). Querying the table to verify something is removed is impossible - hence I wanted to try the cleanup to make it smaller before truncating the history tables.

Thx,

-Tobias

hazel_cruzat · October 6, 2020, 11:46am

May I know what’s your configuration for the batch size and parallelism for this clean up?