Zeebe becomes inactive without error logs

Hi everyone,

I am currently testing a Zeebe deployment in a self-managed environment and encountering issues where Zeebe stops working at one point of time without further notice. Id like to know what could potentially cause a failure of this type and how I can proceed with debugging my setup.

Setup

Since the setup is only for testing in a development environment, Zeebe is deployed in a single-cluster configuration where the container also runs the embedded Gateway. No other components of Camunda 8 are being used (setup should be minimalistic at this point of time).

Zeebe is run as a single container in AWS ECS (Fargate) and its files are stored using a AWS EFS file system.

Additional details about my configuration:
Image: camunda/zeebe:8.4.0
Persisted file system path: /usr/local/zeebe/data
Environment variables:

ZEEBE_BROKER_GATEWAY_ENABLE                    = "true"
ZEEBE_BROKER_GATEWAY_NETWORK_HOST              = "0.0.0.0"
ZEEBE_BROKER_GATEWAY_NETWORK_PORT              = "26500"
ZEEBE_BROKER_GATEWAY_THREADS_MANAGEMENTTHREADS = "1"
ZEEBE_BROKER_GATEWAY_SECURITY_ENABLED          = "false"
ZEEBE_BROKER_GATEWAY_LONGPOLLING_ENABLED       = "true"
ZEEBE_BROKER_NETWORK_HOST                      = "0.0.0.0"
ZEEBE_BROKER_NETWORK_SECURITY_ENABLED          = "false"
ZEEBE_BROKER_NETWORK_COMMANDAPI_PORT           = "26501"
ZEEBE_BROKER_NETWORK_INTERNALAPI_PORT          = "26502"
ZEEBE_BROKER_DATA_DIRECTORY                    = "data"
ZEEBE_BROKER_DATA_LOGSEGMENTSIZE               = "128MB"
ZEEBE_BROKER_DATA_SNAPSHOTPERIOD               = "15m"
ZEEBE_BROKER_DATA_DISK_FREESPACE_PROCESSING    = "2GB"
ZEEBE_BROKER_DATA_DISK_FREESPACE_REPLICATION   = "1GB"
ZEEBE_BROKER_DATA_DISK_MONITORINGINTERVAL      = "1s"
ZEEBE_BROKER_DATA_BACKUP_STORE                 = "NONE"
ZEEBE_BROKER_CLUSTER_NODEID                    = "0"
ZEEBE_BROKER_CLUSTER_PARTITIONSCOUNT           = "1"
ZEEBE_BROKER_CLUSTER_REPLICATIONFACTOR         = "1"
ZEEBE_BROKER_CLUSTER_CLUSTERSIZE               = "1"
ZEEBE_BROKER_CLUSTER_CLUSTERNAME               = "***-zeebe-cluster"
ZEEBE_BROKER_CLUSTER_MESSAGECOMPRESSION        = "NONE"
ZEEBE_BROKER_THREADS_CPUTHREADCOUNT            = "1"
ZEEBE_BROKER_THREADS_IOTHREADCOUNT             = "1"
ZEEBE_LOG_APPENDER = "Stackdriver"

Usage scenario

My worker application deploys BPMN definitions for various processes to Zeebe. One of them has a time start event and acts as a scheduler that is executed every 30 minutes. Therefore, Zeebe should continously execute processes.

Failure description

When I initially start the Zeebe instance, it does not log any errors and executes the BPMN process reliably. However, after a certain amount of time (which reached from a day to a week of running), Zeebe does not continue to execute the BPMN process every 30 minutes, but instead just stops creating log entries or doing anything at all.

An exemplary final log created before this state is the following:

{
    "severity": "INFO",
    "logging.googleapis.com/sourceLocation": {
        "function": "persistNewSnapshot",
        "file": "FileBasedSnapshotStore.java",
        "line": 567
    },
    "message": "Committed new snapshot 3796747-6-3831968-9223372036854775807",
    "serviceContext": {
        "service": "zeebe",
        "version": "development"
    },
    "context": {
        "threadId": 18,
        "partitionId": "1",
        "actor-scheduler": "Broker-0",
        "threadPriority": 5,
        "loggerName": "io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore",
        "threadName": "zb-actors-0",
        "actor-name": "SnapshotStore-1"
    },
    "timestampSeconds": 1705717444,
    "timestampNanos": 775348025
}

From my understanding, everything seems to work correctly here. However, no processes were executed after this point of time.

On the client side, activating jobs of various types fails after the broker stops working:

{
    "timestamp": "2024-01-20T02:27:03.739Z",
    "loggerName": "io.camunda.zeebe.client.job.poller",
    "level": "WARN",
    "message": "Failed to activate jobs for worker default and job type my-job-type",
    "threadName": "grpc-default-executor-428",
    "errorType": "io.grpc.StatusRuntimeException",
    "errorMessage": "DEADLINE_EXCEEDED: deadline exceeded after 19.999997415s. Name resolution delay 0.000000000 seconds. [closed=[], open=[[remote_addr=zeebe.my-namespace/10.0.1.14:26500]]]",
    "errorStackTrace": "io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 19.999997415s. Name resolution delay 0.000000000 seconds. [closed=[], open=[[remote_addr=zeebe.my-namespace/10.0.1.14:26500]]]\n\tat io.grpc.Status.asRuntimeException(Status.java:537)\n\tat io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:481)\n\tat io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:574)\n\tat io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:72)\n\tat io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742)\n\tat io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n"
}

After I replaced the container some time later, Zeebe complains that the checksum of a file(?) is incorrect. This happens immediately after the start.

{
    "severity": "ERROR",
    "logging.googleapis.com/sourceLocation": {
        "function": "error",
        "file": "ContextualLogger.java",
        "line": 406
    },
    "message": "RaftServer{raft-partition-partition-1} - An uncaught exception occurred, transition to inactive role",
    "serviceContext": {
        "service": "zeebe",
        "version": "development"
    },
    "context": {
        "threadId": 36,
        "partitionId": "1",
        "actor-scheduler": "Broker-0",
        "threadPriority": 5,
        "loggerName": "io.atomix.raft.impl.RaftContext",
        "threadName": "raft-server-0-1",
        "actor-name": "raft-server-1"
    },
    "@type": "type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent",
    "exception": "io.camunda.zeebe.journal.CorruptedJournalException: Record's checksum (2512628316) doesn't match checksum stored in metadata (3108830540).\n\tat io.camunda.zeebe.journal.record.JournalRecordReaderUtil.read(JournalRecordReaderUtil.java:66) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentReader.next(SegmentReader.java:67) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentedJournalReader.unsafeNext(SegmentedJournalReader.java:73) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentedJournalReader.next(SegmentedJournalReader.java:62) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentedJournalReader.seekToAsqn(SegmentedJournalReader.java:138) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentedJournalReader.seekToAsqn(SegmentedJournalReader.java:115) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.storage.log.RaftLogUncommittedReader.seekToAsqn(RaftLogUncommittedReader.java:72) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.roles.LeaderRole.findLastZeebeEntry(LeaderRole.java:311) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.roles.LeaderRole.start(LeaderRole.java:98) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.impl.RaftContext.transition(RaftContext.java:796) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.roles.CandidateRole.start(CandidateRole.java:53) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.impl.RaftContext.transition(RaftContext.java:796) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.roles.FollowerRole.start(FollowerRole.java:65) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.impl.RaftContext.transition(RaftContext.java:796) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.impl.RaftContext.transition(RaftContext.java:893) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.cluster.impl.RaftClusterContext.lambda$bootstrap$0(RaftClusterContext.java:85) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.utils.concurrent.SingleThreadContext$WrappedRunnable.run(SingleThreadContext.java:178) ~[zeebe-atomix-utils-8.4.0.jar:8.4.0]\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]\n\tat java.base/java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]\n\tat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]\n\tat java.base/java.lang.Thread.run(Unknown Source) [?:?]\n",
    "timestampSeconds": 1705928983,
    "timestampNanos": 438521227
}

This error did not occur always when I restarted Zeebe after becoming inactive, but it might help with debugging.

Questions:

  • What could Zeebe cause to stop failing without further notice like in my scenario?
  • Are there any further debugging steps I can take to investigate on the issue?

Any help would be very appreciated.

Thanks in advance, Janek

Similar issue is addressed at this thread.

Thank you for linking the issue. While the journal corruption error is similar to what happened in my instance, the described behaviour of a crash-loop is not what happened. I think the corrupted journal might be a result of the forced restart I performed after Zeebe became inactive, but since no error message was displayed, I assume it was not what caused Zeebe to stop working (without quitting) in the first place.

I might be able to resolve the issue for now by clearing the filesystem as suggested in the ticket, but based on my previous attempts, It wont take very long until Zeebe stops working again.

Anyone else who can give additional input on this type of error?

Hi,
Did you ever find a solution for this problem? I’m running in to a similar problem running zeebe on ECS and getting a CorruptedJournalException.

Hey Ian,

I did find out the reason for the issues that I was experiencing. In early February, Camunda updated the documentation on self-managed deployments with a notice stating that that NFS storage is not supported by Zeebe. A GitHub issue regarding this topic was also opened: Support NFS · Issue #16686 · camunda/camunda · GitHub

Since EFS, which is NFS-based, is (to my knowledge) the only persistent storage option for ECS Fargate, its not possible to run Zeebe with it at the moment. The primary options I can see are:

  1. Use ECS with EC2 nodes instead of Fargate to be able to make use of EBS volumes.
  2. Migrate to EKS with EC2 nodes and use EBS volumes as Kubernetes PVs.

I decided to go with EKS (partially due to features like the EBS CSI driver and cluster autoscaler). The deployment is only up and running for 1.5 weeks, so I cant guarantee for anything, but the storage issues have disappeared so far.

Janek, thanks very much for your reply.
I have since migrated our Fargate ECS instance on to an EC2 backed instance using EBS. I am not very familiar with k8s/eks so that seemed like the best option for us.
It was also important to modify the autoscaling of the ECS task to make sure that only a single task was running at any one time, otherwise when there was briefly multiple instances running against the same filesystem it was also resulting in a corrupted filesystem exception. This was something that we hadn’t noticed when using EFS.
Anyway it all seems to be working ok now. Thanks again.