Zeebe becomes inactive without error logs

Hi everyone,

I am currently testing a Zeebe deployment in a self-managed environment and encountering issues where Zeebe stops working at one point of time without further notice. Id like to know what could potentially cause a failure of this type and how I can proceed with debugging my setup.

Setup

Since the setup is only for testing in a development environment, Zeebe is deployed in a single-cluster configuration where the container also runs the embedded Gateway. No other components of Camunda 8 are being used (setup should be minimalistic at this point of time).

Zeebe is run as a single container in AWS ECS (Fargate) and its files are stored using a AWS EFS file system.

Additional details about my configuration:
Image: camunda/zeebe:8.4.0
Persisted file system path: /usr/local/zeebe/data
Environment variables:

ZEEBE_BROKER_GATEWAY_ENABLE                    = "true"
ZEEBE_BROKER_GATEWAY_NETWORK_HOST              = "0.0.0.0"
ZEEBE_BROKER_GATEWAY_NETWORK_PORT              = "26500"
ZEEBE_BROKER_GATEWAY_THREADS_MANAGEMENTTHREADS = "1"
ZEEBE_BROKER_GATEWAY_SECURITY_ENABLED          = "false"
ZEEBE_BROKER_GATEWAY_LONGPOLLING_ENABLED       = "true"
ZEEBE_BROKER_NETWORK_HOST                      = "0.0.0.0"
ZEEBE_BROKER_NETWORK_SECURITY_ENABLED          = "false"
ZEEBE_BROKER_NETWORK_COMMANDAPI_PORT           = "26501"
ZEEBE_BROKER_NETWORK_INTERNALAPI_PORT          = "26502"
ZEEBE_BROKER_DATA_DIRECTORY                    = "data"
ZEEBE_BROKER_DATA_LOGSEGMENTSIZE               = "128MB"
ZEEBE_BROKER_DATA_SNAPSHOTPERIOD               = "15m"
ZEEBE_BROKER_DATA_DISK_FREESPACE_PROCESSING    = "2GB"
ZEEBE_BROKER_DATA_DISK_FREESPACE_REPLICATION   = "1GB"
ZEEBE_BROKER_DATA_DISK_MONITORINGINTERVAL      = "1s"
ZEEBE_BROKER_DATA_BACKUP_STORE                 = "NONE"
ZEEBE_BROKER_CLUSTER_NODEID                    = "0"
ZEEBE_BROKER_CLUSTER_PARTITIONSCOUNT           = "1"
ZEEBE_BROKER_CLUSTER_REPLICATIONFACTOR         = "1"
ZEEBE_BROKER_CLUSTER_CLUSTERSIZE               = "1"
ZEEBE_BROKER_CLUSTER_CLUSTERNAME               = "***-zeebe-cluster"
ZEEBE_BROKER_CLUSTER_MESSAGECOMPRESSION        = "NONE"
ZEEBE_BROKER_THREADS_CPUTHREADCOUNT            = "1"
ZEEBE_BROKER_THREADS_IOTHREADCOUNT             = "1"
ZEEBE_LOG_APPENDER = "Stackdriver"

Usage scenario

My worker application deploys BPMN definitions for various processes to Zeebe. One of them has a time start event and acts as a scheduler that is executed every 30 minutes. Therefore, Zeebe should continously execute processes.

Failure description

When I initially start the Zeebe instance, it does not log any errors and executes the BPMN process reliably. However, after a certain amount of time (which reached from a day to a week of running), Zeebe does not continue to execute the BPMN process every 30 minutes, but instead just stops creating log entries or doing anything at all.

An exemplary final log created before this state is the following:

{
    "severity": "INFO",
    "logging.googleapis.com/sourceLocation": {
        "function": "persistNewSnapshot",
        "file": "FileBasedSnapshotStore.java",
        "line": 567
    },
    "message": "Committed new snapshot 3796747-6-3831968-9223372036854775807",
    "serviceContext": {
        "service": "zeebe",
        "version": "development"
    },
    "context": {
        "threadId": 18,
        "partitionId": "1",
        "actor-scheduler": "Broker-0",
        "threadPriority": 5,
        "loggerName": "io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore",
        "threadName": "zb-actors-0",
        "actor-name": "SnapshotStore-1"
    },
    "timestampSeconds": 1705717444,
    "timestampNanos": 775348025
}

From my understanding, everything seems to work correctly here. However, no processes were executed after this point of time.

On the client side, activating jobs of various types fails after the broker stops working:

{
    "timestamp": "2024-01-20T02:27:03.739Z",
    "loggerName": "io.camunda.zeebe.client.job.poller",
    "level": "WARN",
    "message": "Failed to activate jobs for worker default and job type my-job-type",
    "threadName": "grpc-default-executor-428",
    "errorType": "io.grpc.StatusRuntimeException",
    "errorMessage": "DEADLINE_EXCEEDED: deadline exceeded after 19.999997415s. Name resolution delay 0.000000000 seconds. [closed=[], open=[[remote_addr=zeebe.my-namespace/10.0.1.14:26500]]]",
    "errorStackTrace": "io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 19.999997415s. Name resolution delay 0.000000000 seconds. [closed=[], open=[[remote_addr=zeebe.my-namespace/10.0.1.14:26500]]]\n\tat io.grpc.Status.asRuntimeException(Status.java:537)\n\tat io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:481)\n\tat io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:574)\n\tat io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:72)\n\tat io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742)\n\tat io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n"
}

After I replaced the container some time later, Zeebe complains that the checksum of a file(?) is incorrect. This happens immediately after the start.

{
    "severity": "ERROR",
    "logging.googleapis.com/sourceLocation": {
        "function": "error",
        "file": "ContextualLogger.java",
        "line": 406
    },
    "message": "RaftServer{raft-partition-partition-1} - An uncaught exception occurred, transition to inactive role",
    "serviceContext": {
        "service": "zeebe",
        "version": "development"
    },
    "context": {
        "threadId": 36,
        "partitionId": "1",
        "actor-scheduler": "Broker-0",
        "threadPriority": 5,
        "loggerName": "io.atomix.raft.impl.RaftContext",
        "threadName": "raft-server-0-1",
        "actor-name": "raft-server-1"
    },
    "@type": "type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent",
    "exception": "io.camunda.zeebe.journal.CorruptedJournalException: Record's checksum (2512628316) doesn't match checksum stored in metadata (3108830540).\n\tat io.camunda.zeebe.journal.record.JournalRecordReaderUtil.read(JournalRecordReaderUtil.java:66) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentReader.next(SegmentReader.java:67) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentedJournalReader.unsafeNext(SegmentedJournalReader.java:73) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentedJournalReader.next(SegmentedJournalReader.java:62) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentedJournalReader.seekToAsqn(SegmentedJournalReader.java:138) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentedJournalReader.seekToAsqn(SegmentedJournalReader.java:115) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.storage.log.RaftLogUncommittedReader.seekToAsqn(RaftLogUncommittedReader.java:72) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.roles.LeaderRole.findLastZeebeEntry(LeaderRole.java:311) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.roles.LeaderRole.start(LeaderRole.java:98) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.impl.RaftContext.transition(RaftContext.java:796) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.roles.CandidateRole.start(CandidateRole.java:53) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.impl.RaftContext.transition(RaftContext.java:796) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.roles.FollowerRole.start(FollowerRole.java:65) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.impl.RaftContext.transition(RaftContext.java:796) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.impl.RaftContext.transition(RaftContext.java:893) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.cluster.impl.RaftClusterContext.lambda$bootstrap$0(RaftClusterContext.java:85) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.utils.concurrent.SingleThreadContext$WrappedRunnable.run(SingleThreadContext.java:178) ~[zeebe-atomix-utils-8.4.0.jar:8.4.0]\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]\n\tat java.base/java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]\n\tat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]\n\tat java.base/java.lang.Thread.run(Unknown Source) [?:?]\n",
    "timestampSeconds": 1705928983,
    "timestampNanos": 438521227
}

This error did not occur always when I restarted Zeebe after becoming inactive, but it might help with debugging.

Questions:

  • What could Zeebe cause to stop failing without further notice like in my scenario?
  • Are there any further debugging steps I can take to investigate on the issue?

Any help would be very appreciated.

Thanks in advance, Janek

Similar issue is addressed at this thread.

Thank you for linking the issue. While the journal corruption error is similar to what happened in my instance, the described behaviour of a crash-loop is not what happened. I think the corrupted journal might be a result of the forced restart I performed after Zeebe became inactive, but since no error message was displayed, I assume it was not what caused Zeebe to stop working (without quitting) in the first place.

I might be able to resolve the issue for now by clearing the filesystem as suggested in the ticket, but based on my previous attempts, It wont take very long until Zeebe stops working again.

Anyone else who can give additional input on this type of error?