Hi everyone,
I am currently testing a Zeebe deployment in a self-managed environment and encountering issues where Zeebe stops working at one point of time without further notice. Id like to know what could potentially cause a failure of this type and how I can proceed with debugging my setup.
Setup
Since the setup is only for testing in a development environment, Zeebe is deployed in a single-cluster configuration where the container also runs the embedded Gateway. No other components of Camunda 8 are being used (setup should be minimalistic at this point of time).
Zeebe is run as a single container in AWS ECS (Fargate) and its files are stored using a AWS EFS file system.
Additional details about my configuration:
Image: camunda/zeebe:8.4.0
Persisted file system path: /usr/local/zeebe/data
Environment variables:
ZEEBE_BROKER_GATEWAY_ENABLE = "true"
ZEEBE_BROKER_GATEWAY_NETWORK_HOST = "0.0.0.0"
ZEEBE_BROKER_GATEWAY_NETWORK_PORT = "26500"
ZEEBE_BROKER_GATEWAY_THREADS_MANAGEMENTTHREADS = "1"
ZEEBE_BROKER_GATEWAY_SECURITY_ENABLED = "false"
ZEEBE_BROKER_GATEWAY_LONGPOLLING_ENABLED = "true"
ZEEBE_BROKER_NETWORK_HOST = "0.0.0.0"
ZEEBE_BROKER_NETWORK_SECURITY_ENABLED = "false"
ZEEBE_BROKER_NETWORK_COMMANDAPI_PORT = "26501"
ZEEBE_BROKER_NETWORK_INTERNALAPI_PORT = "26502"
ZEEBE_BROKER_DATA_DIRECTORY = "data"
ZEEBE_BROKER_DATA_LOGSEGMENTSIZE = "128MB"
ZEEBE_BROKER_DATA_SNAPSHOTPERIOD = "15m"
ZEEBE_BROKER_DATA_DISK_FREESPACE_PROCESSING = "2GB"
ZEEBE_BROKER_DATA_DISK_FREESPACE_REPLICATION = "1GB"
ZEEBE_BROKER_DATA_DISK_MONITORINGINTERVAL = "1s"
ZEEBE_BROKER_DATA_BACKUP_STORE = "NONE"
ZEEBE_BROKER_CLUSTER_NODEID = "0"
ZEEBE_BROKER_CLUSTER_PARTITIONSCOUNT = "1"
ZEEBE_BROKER_CLUSTER_REPLICATIONFACTOR = "1"
ZEEBE_BROKER_CLUSTER_CLUSTERSIZE = "1"
ZEEBE_BROKER_CLUSTER_CLUSTERNAME = "***-zeebe-cluster"
ZEEBE_BROKER_CLUSTER_MESSAGECOMPRESSION = "NONE"
ZEEBE_BROKER_THREADS_CPUTHREADCOUNT = "1"
ZEEBE_BROKER_THREADS_IOTHREADCOUNT = "1"
ZEEBE_LOG_APPENDER = "Stackdriver"
Usage scenario
My worker application deploys BPMN definitions for various processes to Zeebe. One of them has a time start event and acts as a scheduler that is executed every 30 minutes. Therefore, Zeebe should continously execute processes.
Failure description
When I initially start the Zeebe instance, it does not log any errors and executes the BPMN process reliably. However, after a certain amount of time (which reached from a day to a week of running), Zeebe does not continue to execute the BPMN process every 30 minutes, but instead just stops creating log entries or doing anything at all.
An exemplary final log created before this state is the following:
{
"severity": "INFO",
"logging.googleapis.com/sourceLocation": {
"function": "persistNewSnapshot",
"file": "FileBasedSnapshotStore.java",
"line": 567
},
"message": "Committed new snapshot 3796747-6-3831968-9223372036854775807",
"serviceContext": {
"service": "zeebe",
"version": "development"
},
"context": {
"threadId": 18,
"partitionId": "1",
"actor-scheduler": "Broker-0",
"threadPriority": 5,
"loggerName": "io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore",
"threadName": "zb-actors-0",
"actor-name": "SnapshotStore-1"
},
"timestampSeconds": 1705717444,
"timestampNanos": 775348025
}
From my understanding, everything seems to work correctly here. However, no processes were executed after this point of time.
On the client side, activating jobs of various types fails after the broker stops working:
{
"timestamp": "2024-01-20T02:27:03.739Z",
"loggerName": "io.camunda.zeebe.client.job.poller",
"level": "WARN",
"message": "Failed to activate jobs for worker default and job type my-job-type",
"threadName": "grpc-default-executor-428",
"errorType": "io.grpc.StatusRuntimeException",
"errorMessage": "DEADLINE_EXCEEDED: deadline exceeded after 19.999997415s. Name resolution delay 0.000000000 seconds. [closed=[], open=[[remote_addr=zeebe.my-namespace/10.0.1.14:26500]]]",
"errorStackTrace": "io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 19.999997415s. Name resolution delay 0.000000000 seconds. [closed=[], open=[[remote_addr=zeebe.my-namespace/10.0.1.14:26500]]]\n\tat io.grpc.Status.asRuntimeException(Status.java:537)\n\tat io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:481)\n\tat io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:574)\n\tat io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:72)\n\tat io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742)\n\tat io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n"
}
After I replaced the container some time later, Zeebe complains that the checksum of a file(?) is incorrect. This happens immediately after the start.
{
"severity": "ERROR",
"logging.googleapis.com/sourceLocation": {
"function": "error",
"file": "ContextualLogger.java",
"line": 406
},
"message": "RaftServer{raft-partition-partition-1} - An uncaught exception occurred, transition to inactive role",
"serviceContext": {
"service": "zeebe",
"version": "development"
},
"context": {
"threadId": 36,
"partitionId": "1",
"actor-scheduler": "Broker-0",
"threadPriority": 5,
"loggerName": "io.atomix.raft.impl.RaftContext",
"threadName": "raft-server-0-1",
"actor-name": "raft-server-1"
},
"@type": "type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent",
"exception": "io.camunda.zeebe.journal.CorruptedJournalException: Record's checksum (2512628316) doesn't match checksum stored in metadata (3108830540).\n\tat io.camunda.zeebe.journal.record.JournalRecordReaderUtil.read(JournalRecordReaderUtil.java:66) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentReader.next(SegmentReader.java:67) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentedJournalReader.unsafeNext(SegmentedJournalReader.java:73) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentedJournalReader.next(SegmentedJournalReader.java:62) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentedJournalReader.seekToAsqn(SegmentedJournalReader.java:138) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentedJournalReader.seekToAsqn(SegmentedJournalReader.java:115) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.storage.log.RaftLogUncommittedReader.seekToAsqn(RaftLogUncommittedReader.java:72) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.roles.LeaderRole.findLastZeebeEntry(LeaderRole.java:311) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.roles.LeaderRole.start(LeaderRole.java:98) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.impl.RaftContext.transition(RaftContext.java:796) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.roles.CandidateRole.start(CandidateRole.java:53) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.impl.RaftContext.transition(RaftContext.java:796) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.roles.FollowerRole.start(FollowerRole.java:65) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.impl.RaftContext.transition(RaftContext.java:796) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.impl.RaftContext.transition(RaftContext.java:893) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.cluster.impl.RaftClusterContext.lambda$bootstrap$0(RaftClusterContext.java:85) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.utils.concurrent.SingleThreadContext$WrappedRunnable.run(SingleThreadContext.java:178) ~[zeebe-atomix-utils-8.4.0.jar:8.4.0]\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]\n\tat java.base/java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]\n\tat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]\n\tat java.base/java.lang.Thread.run(Unknown Source) [?:?]\n",
"timestampSeconds": 1705928983,
"timestampNanos": 438521227
}
This error did not occur always when I restarted Zeebe after becoming inactive, but it might help with debugging.
Questions:
- What could Zeebe cause to stop failing without further notice like in my scenario?
- Are there any further debugging steps I can take to investigate on the issue?
Any help would be very appreciated.
Thanks in advance, Janek