Zeebe becomes inactive without error logs

janekberg · January 23, 2024, 8:45am

Hi everyone,

I am currently testing a Zeebe deployment in a self-managed environment and encountering issues where Zeebe stops working at one point of time without further notice. Id like to know what could potentially cause a failure of this type and how I can proceed with debugging my setup.

Setup

Since the setup is only for testing in a development environment, Zeebe is deployed in a single-cluster configuration where the container also runs the embedded Gateway. No other components of Camunda 8 are being used (setup should be minimalistic at this point of time).

Zeebe is run as a single container in AWS ECS (Fargate) and its files are stored using a AWS EFS file system.

Additional details about my configuration:
Image: camunda/zeebe:8.4.0
Persisted file system path: /usr/local/zeebe/data
Environment variables:

ZEEBE_BROKER_GATEWAY_ENABLE                    = "true"
ZEEBE_BROKER_GATEWAY_NETWORK_HOST              = "0.0.0.0"
ZEEBE_BROKER_GATEWAY_NETWORK_PORT              = "26500"
ZEEBE_BROKER_GATEWAY_THREADS_MANAGEMENTTHREADS = "1"
ZEEBE_BROKER_GATEWAY_SECURITY_ENABLED          = "false"
ZEEBE_BROKER_GATEWAY_LONGPOLLING_ENABLED       = "true"
ZEEBE_BROKER_NETWORK_HOST                      = "0.0.0.0"
ZEEBE_BROKER_NETWORK_SECURITY_ENABLED          = "false"
ZEEBE_BROKER_NETWORK_COMMANDAPI_PORT           = "26501"
ZEEBE_BROKER_NETWORK_INTERNALAPI_PORT          = "26502"
ZEEBE_BROKER_DATA_DIRECTORY                    = "data"
ZEEBE_BROKER_DATA_LOGSEGMENTSIZE               = "128MB"
ZEEBE_BROKER_DATA_SNAPSHOTPERIOD               = "15m"
ZEEBE_BROKER_DATA_DISK_FREESPACE_PROCESSING    = "2GB"
ZEEBE_BROKER_DATA_DISK_FREESPACE_REPLICATION   = "1GB"
ZEEBE_BROKER_DATA_DISK_MONITORINGINTERVAL      = "1s"
ZEEBE_BROKER_DATA_BACKUP_STORE                 = "NONE"
ZEEBE_BROKER_CLUSTER_NODEID                    = "0"
ZEEBE_BROKER_CLUSTER_PARTITIONSCOUNT           = "1"
ZEEBE_BROKER_CLUSTER_REPLICATIONFACTOR         = "1"
ZEEBE_BROKER_CLUSTER_CLUSTERSIZE               = "1"
ZEEBE_BROKER_CLUSTER_CLUSTERNAME               = "***-zeebe-cluster"
ZEEBE_BROKER_CLUSTER_MESSAGECOMPRESSION        = "NONE"
ZEEBE_BROKER_THREADS_CPUTHREADCOUNT            = "1"
ZEEBE_BROKER_THREADS_IOTHREADCOUNT             = "1"
ZEEBE_LOG_APPENDER = "Stackdriver"

Usage scenario

My worker application deploys BPMN definitions for various processes to Zeebe. One of them has a time start event and acts as a scheduler that is executed every 30 minutes. Therefore, Zeebe should continously execute processes.

Failure description

When I initially start the Zeebe instance, it does not log any errors and executes the BPMN process reliably. However, after a certain amount of time (which reached from a day to a week of running), Zeebe does not continue to execute the BPMN process every 30 minutes, but instead just stops creating log entries or doing anything at all.

An exemplary final log created before this state is the following:

{
    "severity": "INFO",
    "logging.googleapis.com/sourceLocation": {
        "function": "persistNewSnapshot",
        "file": "FileBasedSnapshotStore.java",
        "line": 567
    },
    "message": "Committed new snapshot 3796747-6-3831968-9223372036854775807",
    "serviceContext": {
        "service": "zeebe",
        "version": "development"
    },
    "context": {
        "threadId": 18,
        "partitionId": "1",
        "actor-scheduler": "Broker-0",
        "threadPriority": 5,
        "loggerName": "io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore",
        "threadName": "zb-actors-0",
        "actor-name": "SnapshotStore-1"
    },
    "timestampSeconds": 1705717444,
    "timestampNanos": 775348025
}

From my understanding, everything seems to work correctly here. However, no processes were executed after this point of time.

On the client side, activating jobs of various types fails after the broker stops working:

{
    "timestamp": "2024-01-20T02:27:03.739Z",
    "loggerName": "io.camunda.zeebe.client.job.poller",
    "level": "WARN",
    "message": "Failed to activate jobs for worker default and job type my-job-type",
    "threadName": "grpc-default-executor-428",
    "errorType": "io.grpc.StatusRuntimeException",
    "errorMessage": "DEADLINE_EXCEEDED: deadline exceeded after 19.999997415s. Name resolution delay 0.000000000 seconds. [closed=[], open=[[remote_addr=zeebe.my-namespace/10.0.1.14:26500]]]",
    "errorStackTrace": "io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 19.999997415s. Name resolution delay 0.000000000 seconds. [closed=[], open=[[remote_addr=zeebe.my-namespace/10.0.1.14:26500]]]\n\tat io.grpc.Status.asRuntimeException(Status.java:537)\n\tat io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:481)\n\tat io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:574)\n\tat io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:72)\n\tat io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742)\n\tat io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n"
}

After I replaced the container some time later, Zeebe complains that the checksum of a file(?) is incorrect. This happens immediately after the start.

{
    "severity": "ERROR",
    "logging.googleapis.com/sourceLocation": {
        "function": "error",
        "file": "ContextualLogger.java",
        "line": 406
    },
    "message": "RaftServer{raft-partition-partition-1} - An uncaught exception occurred, transition to inactive role",
    "serviceContext": {
        "service": "zeebe",
        "version": "development"
    },
    "context": {
        "threadId": 36,
        "partitionId": "1",
        "actor-scheduler": "Broker-0",
        "threadPriority": 5,
        "loggerName": "io.atomix.raft.impl.RaftContext",
        "threadName": "raft-server-0-1",
        "actor-name": "raft-server-1"
    },
    "@type": "type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent",
    "exception": "io.camunda.zeebe.journal.CorruptedJournalException: Record's checksum (2512628316) doesn't match checksum stored in metadata (3108830540).\n\tat io.camunda.zeebe.journal.record.JournalRecordReaderUtil.read(JournalRecordReaderUtil.java:66) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentReader.next(SegmentReader.java:67) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentedJournalReader.unsafeNext(SegmentedJournalReader.java:73) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentedJournalReader.next(SegmentedJournalReader.java:62) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentedJournalReader.seekToAsqn(SegmentedJournalReader.java:138) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.camunda.zeebe.journal.file.SegmentedJournalReader.seekToAsqn(SegmentedJournalReader.java:115) ~[zeebe-journal-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.storage.log.RaftLogUncommittedReader.seekToAsqn(RaftLogUncommittedReader.java:72) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.roles.LeaderRole.findLastZeebeEntry(LeaderRole.java:311) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.roles.LeaderRole.start(LeaderRole.java:98) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.impl.RaftContext.transition(RaftContext.java:796) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.roles.CandidateRole.start(CandidateRole.java:53) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.impl.RaftContext.transition(RaftContext.java:796) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.roles.FollowerRole.start(FollowerRole.java:65) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.impl.RaftContext.transition(RaftContext.java:796) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.impl.RaftContext.transition(RaftContext.java:893) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.raft.cluster.impl.RaftClusterContext.lambda$bootstrap$0(RaftClusterContext.java:85) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0]\n\tat io.atomix.utils.concurrent.SingleThreadContext$WrappedRunnable.run(SingleThreadContext.java:178) ~[zeebe-atomix-utils-8.4.0.jar:8.4.0]\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]\n\tat java.base/java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]\n\tat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]\n\tat java.base/java.lang.Thread.run(Unknown Source) [?:?]\n",
    "timestampSeconds": 1705928983,
    "timestampNanos": 438521227
}

This error did not occur always when I restarted Zeebe after becoming inactive, but it might help with debugging.

Questions:

What could Zeebe cause to stop failing without further notice like in my scenario?
Are there any further debugging steps I can take to investigate on the issue?

Any help would be very appreciated.

Thanks in advance, Janek

cpbpm · January 24, 2024, 4:59pm

Similar issue is addressed at this thread.

github.com/camunda/zeebe

CorruptedJournalException: Fail to read version byte from segment

opened 04:54AM - 12 Apr 23 UTC

closed 06:07PM - 02 May 23 UTC

saig0

kind/bug severity/mid area/reliability version:8.2.4 version:8.3.0-alpha2 version:8.1.13 version:8.0.16 version:8.3.0

**Describe the bug** A Zeebe broker is crash looping. The broker tries to sta…rt up but failed with the following error message. ``` io.camunda.zeebe.journal.CorruptedJournalException: Expected to read the version byte from segment 'raft-partition-partition-1-1.log' but got EOF instead ``` The broker was part of a cluster with three brokers. The other brokers were healthy and continued processing. To mitigate the issue, we did a fresh restart of the broker. We removed all data and restarted the broker. After the restart, the broker was healthy again and joined the cluster. **To Reproduce** Unknown. The broker was forced to shut down (`Shutdown was called with context: ...`). The broker created a new snapshot 30 seconds before. The log contains no warnings or suspicious behavior before or after the shutdown. _EDIT:_ It seems `zeebe-2` was restarted while receiving a snapshot and resetting the log: ``` INFO 2023-04-06T15:37:07.317296306Z [jsonPayload.context.partitionId: 1] [resource.labels.containerName: zeebe] RaftServer{raft-partition-partition-1}{role=FOLLOWER} - Started receiving new snapshot FileBasedReceivedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/pending/14080296-246-15599729-15599732-1, snapshotStore=Broker-2-SnapshotStore-1, metadata=FileBasedSnapshotId{index=14080296, term=246, processedPosition=15599729, exporterPosition=15599732}} from 1 INFO 2023-04-06T15:37:07.490944632Z [jsonPayload.context.partitionId: 1] [resource.labels.containerName: zeebe] RaftServer{raft-partition-partition-1}{role=FOLLOWER} - Delete existing log (lastIndex '14080107') and replace with received snapshot (index '14080296'). First entry in the log will be at index 14080297 ERROR 2023-04-06T15:39:35.994311846Z [resource.labels.containerName: zeebe] + export SPRING_CONFIG_LOCATION=classpath:/,file:./config/zeebe.cfg.yaml ``` These are the last logs from that broker, and immediately it was restarted. Most likely, the log segment on the disk is an intermediate state where the previous segments have been deleted, the new one is only partially created - which is detected as corruption after the restart. **Expected behavior** The broker can handle this corruption failure. For example, by removing (or archiving) the corrupted data and fetching the latest data from the cluster. **Log/Stacktrace** <details><summary>Full Stacktrace</summary> <p> ``` java.util.concurrent.ExecutionException: Startup failed in the following steps: [Partition Manager]. See suppressed exceptions for details. at io.camunda.zeebe.scheduler.future.CompletableActorFuture.get(CompletableActorFuture.java:142) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.future.CompletableActorFuture.get(CompletableActorFuture.java:109) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.FutureUtil.join(FutureUtil.java:21) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.future.CompletableActorFuture.join(CompletableActorFuture.java:198) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.broker.Broker.internalStart(Broker.java:101) ~[zeebe-broker-8.1.3.jar:8.1.3] at io.camunda.zeebe.util.LogUtil.doWithMDC(LogUtil.java:23) ~[zeebe-util-8.1.3.jar:8.1.3] at io.camunda.zeebe.broker.Broker.start(Broker.java:83) ~[zeebe-broker-8.1.3.jar:8.1.3] at io.camunda.zeebe.broker.StandaloneBroker.run(StandaloneBroker.java:92) ~[camunda-zeebe-8.1.3.jar:8.1.3] at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:771) ~[spring-boot-2.7.4.jar:2.7.4] at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:755) ~[spring-boot-2.7.4.jar:2.7.4] at org.springframework.boot.SpringApplication.run(SpringApplication.java:315) ~[spring-boot-2.7.4.jar:2.7.4] at io.camunda.zeebe.broker.StandaloneBroker.main(StandaloneBroker.java:82) ~[camunda-zeebe-8.1.3.jar:8.1.3] Caused by: io.camunda.zeebe.scheduler.startup.StartupProcessException: Startup failed in the following steps: [Partition Manager]. See suppressed exceptions for details. at io.camunda.zeebe.scheduler.startup.StartupProcess.aggregateExceptionsSynchronized(StartupProcess.java:282) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.startup.StartupProcess.completeStartupFutureExceptionallySynchronized(StartupProcess.java:183) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.startup.StartupProcess.lambda$proceedWithStartupSynchronized$3(StartupProcess.java:167) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.future.FutureContinuationRunnable.run(FutureContinuationRunnable.java:33) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.ActorJob.invoke(ActorJob.java:94) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.ActorJob.execute(ActorJob.java:45) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.ActorTask.execute(ActorTask.java:119) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.ActorThread.executeCurrentTask(ActorThread.java:106) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.ActorThread.doWork(ActorThread.java:87) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.ActorThread.run(ActorThread.java:198) ~[zeebe-scheduler-8.1.3.jar:8.1.3] Suppressed: io.camunda.zeebe.scheduler.startup.StartupProcessStepException: Bootstrap step Partition Manager failed at io.camunda.zeebe.scheduler.startup.StartupProcess.completeStartupFutureExceptionallySynchronized(StartupProcess.java:185) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.startup.StartupProcess.lambda$proceedWithStartupSynchronized$3(StartupProcess.java:167) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.future.FutureContinuationRunnable.run(FutureContinuationRunnable.java:33) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.ActorJob.invoke(ActorJob.java:94) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.ActorJob.execute(ActorJob.java:45) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.ActorTask.execute(ActorTask.java:119) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.ActorThread.executeCurrentTask(ActorThread.java:106) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.ActorThread.doWork(ActorThread.java:87) ~[zeebe-scheduler-8.1.3.jar:8.1.3] at io.camunda.zeebe.scheduler.ActorThread.run(ActorThread.java:198) ~[zeebe-scheduler-8.1.3.jar:8.1.3] Caused by: java.util.concurrent.CompletionException: io.camunda.zeebe.journal.CorruptedJournalException: Expected to read the version byte from segment 'raft-partition-partition-1-1.log' but got EOF instead. at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.uniApplyNow(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.uniApplyStage(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.thenApply(Unknown Source) ~[?:?] at io.atomix.raft.partition.RaftPartition.open(RaftPartition.java:104) ~[zeebe-atomix-cluster-8.1.3.jar:8.1.3] at io.atomix.raft.partition.RaftPartitionGroup.lambda$join$7(RaftPartitionGroup.java:201) ~[zeebe-atomix-cluster-8.1.3.jar:8.1.3] at java.util.stream.ReferencePipeline$3$1.accept(Unknown Source) ~[?:?] at java.util.HashMap$KeySpliterator.forEachRemaining(Unknown Source) ~[?:?] at java.util.stream.AbstractPipeline.copyInto(Unknown Source) ~[?:?] at java.util.stream.AbstractPipeline.wrapAndCopyInto(Unknown Source) ~[?:?] at java.util.stream.AbstractPipeline.evaluate(Unknown Source) ~[?:?] at java.util.stream.AbstractPipeline.evaluateToArrayNode(Unknown Source) ~[?:?] at java.util.stream.ReferencePipeline.toArray(Unknown Source) ~[?:?] at java.util.stream.ReferencePipeline.toArray(Unknown Source) ~[?:?] at java.util.stream.ReferencePipeline.toList(Unknown Source) ~[?:?] at io.atomix.raft.partition.RaftPartitionGroup.join(RaftPartitionGroup.java:203) ~[zeebe-atomix-cluster-8.1.3.jar:8.1.3] at io.atomix.primitive.partition.impl.DefaultPartitionService.start(DefaultPartitionService.java:63) ~[zeebe-atomix-cluster-8.1.3.jar:8.1.3] at io.camunda.zeebe.broker.partitioning.PartitionManagerImpl.start(PartitionManagerImpl.java:125) ~[zeebe-broker-8.1.3.jar:8.1.3] at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) ~[?:?] at java.lang.Thread.run(Unknown Source) ~[?:?] Caused by: io.camunda.zeebe.journal.CorruptedJournalException: Expected to read the version byte from segment 'raft-partition-partition-1-1.log' but got EOF instead. at io.camunda.zeebe.journal.file.SegmentLoader.readVersion(SegmentLoader.java:173) ~[zeebe-journal-8.1.3.jar:8.1.3] at io.camunda.zeebe.journal.file.SegmentLoader.readDescriptor(SegmentLoader.java:128) ~[zeebe-journal-8.1.3.jar:8.1.3] at io.camunda.zeebe.journal.file.SegmentLoader.loadExistingSegment(SegmentLoader.java:87) ~[zeebe-journal-8.1.3.jar:8.1.3] at io.camunda.zeebe.journal.file.SegmentsManager.loadSegments(SegmentsManager.java:292) ~[zeebe-journal-8.1.3.jar:8.1.3] at io.camunda.zeebe.journal.file.SegmentsManager.open(SegmentsManager.java:238) ~[zeebe-journal-8.1.3.jar:8.1.3] at io.camunda.zeebe.journal.file.SegmentedJournal.<init>(SegmentedJournal.java:61) ~[zeebe-journal-8.1.3.jar:8.1.3] at io.camunda.zeebe.journal.file.SegmentedJournalBuilder.build(SegmentedJournalBuilder.java:158) ~[zeebe-journal-8.1.3.jar:8.1.3] at io.atomix.raft.storage.log.RaftLogBuilder.build(RaftLogBuilder.java:136) ~[zeebe-atomix-cluster-8.1.3.jar:8.1.3] at io.atomix.raft.storage.RaftStorage.openLog(RaftStorage.java:194) ~[zeebe-atomix-cluster-8.1.3.jar:8.1.3] at io.atomix.raft.impl.RaftContext.<init>(RaftContext.java:194) ~[zeebe-atomix-cluster-8.1.3.jar:8.1.3] at io.atomix.raft.impl.DefaultRaftServer$Builder.build(DefaultRaftServer.java:258) ~[zeebe-atomix-cluster-8.1.3.jar:8.1.3] at io.atomix.raft.impl.DefaultRaftServer$Builder.build(DefaultRaftServer.java:232) ~[zeebe-atomix-cluster-8.1.3.jar:8.1.3] at io.atomix.raft.partition.impl.RaftPartitionServer.buildServer(RaftPartitionServer.java:189) ~[zeebe-atomix-cluster-8.1.3.jar:8.1.3] at io.atomix.raft.partition.impl.RaftPartitionServer.initServer(RaftPartitionServer.java:155) ~[zeebe-atomix-cluster-8.1.3.jar:8.1.3] at io.atomix.raft.partition.impl.RaftPartitionServer.start(RaftPartitionServer.java:114) ~[zeebe-atomix-cluster-8.1.3.jar:8.1.3] at io.atomix.raft.partition.RaftPartition.open(RaftPartition.java:104) ~[zeebe-atomix-cluster-8.1.3.jar:8.1.3] at io.atomix.raft.partition.RaftPartitionGroup.lambda$join$7(RaftPartitionGroup.java:201) ~[zeebe-atomix-cluster-8.1.3.jar:8.1.3] at java.util.stream.ReferencePipeline$3$1.accept(Unknown Source) ~[?:?] at java.util.HashMap$KeySpliterator.forEachRemaining(Unknown Source) ~[?:?] at java.util.stream.AbstractPipeline.copyInto(Unknown Source) ~[?:?] at java.util.stream.AbstractPipeline.wrapAndCopyInto(Unknown Source) ~[?:?] at java.util.stream.AbstractPipeline.evaluate(Unknown Source) ~[?:?] at java.util.stream.AbstractPipeline.evaluateToArrayNode(Unknown Source) ~[?:?] at java.util.stream.ReferencePipeline.toArray(Unknown Source) ~[?:?] at java.util.stream.ReferencePipeline.toArray(Unknown Source) ~[?:?] at java.util.stream.ReferencePipeline.toList(Unknown Source) ~[?:?] at io.atomix.raft.partition.RaftPartitionGroup.join(RaftPartitionGroup.java:203) ~[zeebe-atomix-cluster-8.1.3.jar:8.1.3] at io.atomix.primitive.partition.impl.DefaultPartitionService.start(DefaultPartitionService.java:63) ~[zeebe-atomix-cluster-8.1.3.jar:8.1.3] at io.camunda.zeebe.broker.partitioning.PartitionManagerImpl.start(PartitionManagerImpl.java:125) ~[zeebe-broker-8.1.3.jar:8.1.3] at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) ~[?:?] at java.lang.Thread.run(Unknown Source) ~[?:?] ``` </p> </details> See more in the downloaded [log file](https://drive.google.com/file/d/1V89vBHeXytmm_lviRhMjojBhdeoz7mJs/view?usp=share_link). **Environment:** - OS: Camunda SaaS - Zeebe Version: `8.1.3` - Configuration: `prod-worker-3`

janekberg · January 29, 2024, 12:13pm

Thank you for linking the issue. While the journal corruption error is similar to what happened in my instance, the described behaviour of a crash-loop is not what happened. I think the corrupted journal might be a result of the forced restart I performed after Zeebe became inactive, but since no error message was displayed, I assume it was not what caused Zeebe to stop working (without quitting) in the first place.

I might be able to resolve the issue for now by clearing the filesystem as suggested in the ticket, but based on my previous attempts, It wont take very long until Zeebe stops working again.

Anyone else who can give additional input on this type of error?

Ian_Harrison · July 26, 2024, 5:24pm

Hi,
Did you ever find a solution for this problem? I’m running in to a similar problem running zeebe on ECS and getting a CorruptedJournalException.

janekberg · July 27, 2024, 8:12am

Hey Ian,

I did find out the reason for the issues that I was experiencing. In early February, Camunda updated the documentation on self-managed deployments with a notice stating that that NFS storage is not supported by Zeebe. A GitHub issue regarding this topic was also opened: Support NFS · Issue #16686 · camunda/camunda · GitHub

Since EFS, which is NFS-based, is (to my knowledge) the only persistent storage option for ECS Fargate, its not possible to run Zeebe with it at the moment. The primary options I can see are:

Use ECS with EC2 nodes instead of Fargate to be able to make use of EBS volumes.
Migrate to EKS with EC2 nodes and use EBS volumes as Kubernetes PVs.

I decided to go with EKS (partially due to features like the EBS CSI driver and cluster autoscaler). The deployment is only up and running for 1.5 weeks, so I cant guarantee for anything, but the storage issues have disappeared so far.

Ian_Harrison · August 1, 2024, 3:04pm

Janek, thanks very much for your reply.
I have since migrated our Fargate ECS instance on to an EC2 backed instance using EBS. I am not very familiar with k8s/eks so that seemed like the best option for us.
It was also important to modify the autoscaling of the ECS task to make sure that only a single task was running at any one time, otherwise when there was briefly multiple instances running against the same filesystem it was also resulting in a corrupted filesystem exception. This was something that we hadn’t noticed when using EFS.
Anyway it all seems to be working ok now. Thanks again.