Continuous database errors

kilsen · June 5, 2022, 6:49pm

Early this morning, our Camunda server (7.17.0, running in a Docker container) started to report persistence layer exceptions . . . almost CONTINUOUSLY. We’re using MYSQL (MariaDB) as our database, running in its own Docker container on a separate server. I stopped the Camunda Docker container and now when I attempt to restart it, the errors start occurring immediately, and if I try to navigate to the Camunda app in my browser, I get a 500 error. I’ve tried pointing a different Camunda server (normally configured for our test environment) at the production database, and I get the same problem: a constant stream of errors. I don’t believe the problem lies in the Camunda server configuration; it appears that somehow the MySQL database has gotten into a bad state. Are there any Camunda tools or document procedures for checking the consistency of the database tables, and correcting problems?

jonathan.lukas · June 5, 2022, 6:58pm

Hello @kilsen ,

does your MySQL database have enough disk space? As Camunda Platform 7 saves all history data to the same database as Runtime Data by default, it could happen that your database runs out of disk space.

If this is the cause, we can investigate further on how you can prevent this in the future.

Jonathan

kilsen · June 5, 2022, 8:12pm

Jonathan,

Thanks for the suggestion. Plenty of disk space, so we don’t think that’s the problem.

Kevin

jonathan.lukas · June 6, 2022, 5:13am

Hello @kilsen ,

thank you for checking this. Can you post at least one of the exceptions that occurred? This would be helpful. And one question beforehand: Did you update the Camunda Platform 7 version (from older than 7.17.0)

Jonathan

kilsen · June 6, 2022, 12:34pm

Here’s an example of a log message that we’re getting:

06-Jun-2022 12:28:21.945 SEVERE [Thread-4] org.camunda.commons.logging.BaseLogger.logError ENGINE-16004 Exception while closing command context: An exception occurred in the persistence layer. Please check the server logs for a detailed message and the entire exception stack trace.
        org.camunda.bpm.engine.ProcessEngineException: An exception occurred in the persistence layer. Please check the server logs for a detailed message and the entire exception stack trace.
                at org.camunda.bpm.engine.impl.util.ExceptionUtil.wrapPersistenceException(ExceptionUtil.java:263)
                at org.camunda.bpm.engine.impl.db.EnginePersistenceLogger.flushDbOperationException(EnginePersistenceLogger.java:133)
                at org.camunda.bpm.engine.impl.db.entitymanager.DbEntityManager.flushDbOperations(DbEntityManager.java:364)
                at org.camunda.bpm.engine.impl.db.entitymanager.DbEntityManager.flushDbOperationManager(DbEntityManager.java:323)
                at org.camunda.bpm.engine.impl.db.entitymanager.DbEntityManager.flush(DbEntityManager.java:295)
                at org.camunda.bpm.engine.impl.interceptor.CommandContext.flushSessions(CommandContext.java:272)
                at org.camunda.bpm.engine.impl.interceptor.CommandContext.close(CommandContext.java:188)
                at org.camunda.bpm.engine.impl.interceptor.CommandContextInterceptor.execute(CommandContextInterceptor.java:119)
                at org.camunda.bpm.engine.impl.interceptor.ProcessApplicationContextInterceptor.execute(ProcessApplicationContextInterceptor.java:70)
                at org.camunda.bpm.engine.impl.interceptor.CommandCounterInterceptor.execute(CommandCounterInterceptor.java:35)
                at org.camunda.bpm.engine.impl.interceptor.LogInterceptor.execute(LogInterceptor.java:33)
                at org.camunda.bpm.engine.impl.jobexecutor.SequentialJobAcquisitionRunnable.acquireJobs(SequentialJobAcquisitionRunnable.java:164)
                at org.camunda.bpm.engine.impl.jobexecutor.SequentialJobAcquisitionRunnable.run(SequentialJobAcquisitionRunnable.java:80)
                at java.base/java.lang.Thread.run(Thread.java:834)
        Caused by: org.camunda.bpm.engine.ProcessEngineException: ENGINE-03004 Exception while executing Database Operation 'UPDATE AcquirableJobEntity[01765789-e467-11ec-9665-0242ac1f0002]' with message '
### Error flushing statements.  Cause: org.apache.ibatis.executor.BatchExecutorException: org.camunda.bpm.engine.impl.persistence.entity.JobEntity.updateAcquirableJob (batch index #1) failed. Cause: java.sql.BatchUpdateException: Communications link failure

And yes, we had upgraded from 7.10.0 to 7.17.0, back on May 22. We upgraded one step at a time, and after each schema upgrade we re-launched the corresponding version of Camunda (via docker) and confirmed that we could login to the cockpit and see the processes and process instances.

The errors suddenly started yesterday morning, shortly before 10:00 AM UTC.

kilsen · June 6, 2022, 12:55pm

. . . and here’s how we have the JDBC configured:

    <Resource
      uniqueResourceName="process-engine"
      name="jdbc/ProcessEngine"
      auth="Container"
      factory="org.apache.tomcat.jdbc.pool.DataSourceFactory"
      type="javax.sql.DataSource"
      defaultTransactionIsolation="READ_COMMITTED"
      driverClassName="com.mysql.jdbc.Driver"
      username="******"
      password="*******"
      url="jdbc:mysql://*******:3306/camunda_db?autoReconnect=true"
      maxActive="50"
      maxIdle="50"
      minIdle="5"
      maxWait="10000"
      testOnBorrow="true"
      testOnReturn="false"
      testWhileIdle="true"
      validationQuery="SELECT 1"
      timeBetweenEvictionRunsMillis="30000"
      minEvictableIdleTimeMillis="30000"
      removeAbandoned="true"
      removeAbandonedTimeout="60"
    />

I’ve been tweaking those settings and restarting since the errors first started, but to no avail. The default MySQL transaction isolation level is set to READ-COMMITTED - though it had previously been set to REPEATABLE-READ, and I changed it after the researching the errors and realizing it should have been READ-COMMITTED according to the Camunda docs.

kilsen · June 6, 2022, 10:29pm

We seem to have a corrupt MYSQL database. We tried restoring a file backup from a couple days ago, and were able to start up but received a ton of errors indicating the log sequence number is in the future. (And yes, we restored the entire MySQL data directory, including the InnoDB log files.) So after starting up we immediately tried to run a mysqldump in order to reconstruct the database, but when it starts to dump the ACT_HI_ACTINST table, it causes MySQL to crash. So we’re wondering: what would be the downside of dumping the tables individually, except for the history tables, and dropping the history tables. Would the history tables be recreated automatically on the next start of Camunda? Do we need those history tables?

kilsen · June 7, 2022, 9:46pm

The corruption seems to be confined to the two larges ACT_HI tables, as well as the ACT_RU_TASK table. I assume that we can safely proceed without the ACT_HI tables (in other words: without restoring what we had), but I also assume that the ACT_RU_TASK table is absolutely essential. (A glance at the DB schema diagram certainly indicates that.) Any suggestions on how to manually repair that table? It’s got 7585 row, but when we query beyond the first 1000 or so, we start getting fatal MySQL errors.