Unhealthy Zeebe due to exception occurred on moving valid snapshot

Nihal S: Hi Team,
Our zeebe service got unhealthy all of a sudden due to Unexpected exception occurred on moving valid snapshot. We saw following errors before our partition became unhealthy.

Nithin B S:

1647777619565,"      io.zeebe.logstreams.snapshot - Finished taking snapshot, need to wait until last written event position 6187296 is committed, current commit position is 6187296. After that snapshot can be marked as valid."
1647777619566,2022-03-20 12:00:19.565 [Broker-0-SnapshotDirector-1] [Broker-0-zb-fs-workers-1] ERROR
1647777619566,      io.zeebe.logstreams.snapshot - Unexpected exception occurred on moving valid snapshot.
1647777619566,java.lang.IllegalStateException: Snapshot is not valid. It may have been deleted.
1647777619566,	at io.zeebe.snapshots.broker.impl.FileBasedTransientSnapshot.lambda$persist$3(FileBasedTransientSnapshot.java:106) ~[zeebe-snapshots-0.26.2.jar:0.26.2]
1647777619566,	at io.zeebe.util.sched.ActorControl.lambda$call$0(ActorControl.java:136) ~[zeebe-util-0.26.2.jar:0.26.2]
1647777619566,	at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:62) [zeebe-util-0.26.2.jar:0.26.2]
1647777619566,	at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.26.2.jar:0.26.2]
1647777619566,	at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-0.26.2.jar:0.26.2]
1647777619566,	at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:94) [zeebe-util-0.26.2.jar:0.26.2]
1647777619566,	at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78) [zeebe-util-0.26.2.jar:0.26.2]
1647777619566,	at io.zeebe.util.sched.ActorThread.run(ActorThread.java:191) [zeebe-util-0.26.2.jar:0.26.2]
1647777634596,2022-03-20 12:00:34.595 [io.zeebe.gateway.impl.broker.BrokerRequestManager] [Broker-0-zb-actors-0] WARN 
1647777634596,      io.zeebe.gateway - Failed to activate jobs for type XYZ from partition 1
1647777634596,java.util.concurrent.TimeoutException: Request type command-api-1 timed out in 15000 milliseconds
1647777634596,	at io.atomix.cluster.messaging.impl.AbstractClientConnection$Callback.timeout(AbstractClientConnection.java:163) ~[atomix-cluster-0.26.2.jar:0.26.2]
1647777634596,	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
1647777634596,	at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
1647777634596,	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
1647777634596,	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
1647777634596,	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
1647777634596,	at java.lang.Thread.run(Unknown Source) ~[?:?]
1647777634596,2022-03-20 12:00:34.596 [io.zeebe.gateway.impl.broker.BrokerRequestManager] [Broker-0-zb-actors-0] WARN 
1647777634596,      io.zeebe.gateway - Failed to activate jobs for type ABC from partition 1
1647777634596,java.util.concurrent.TimeoutException: Request type command-api-1 timed out in 15000 milliseconds
1647777634596,	at io.atomix.cluster.messaging.impl.AbstractClientConnection$Callback.timeout(AbstractClientConnection.java:163) ~[atomix-cluster-0.26.2.jar:0.26.2]
1647777634596,	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
1647777634596,	at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
1647777634596,	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
1647777634596,	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
1647777634596,	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
1647777634596,	at java.lang.Thread.run(Unknown Source) ~[?:?]
1647777634596,2022-03-20 12:00:34.596 [io.zeebe.gateway.impl.broker.BrokerRequestManager] [Broker-0-zb-actors-0] WARN 
1647777634596,      io.zeebe.gateway - Failed to activate jobs for type DEF from partition 1
1647777634596,java.util.concurrent.TimeoutException: Request type command-api-1 timed out in 15000 milliseconds
1647777634596,	at io.atomix.cluster.messaging.impl.AbstractClientConnection$Callback.timeout(AbstractClientConnection.java:163) ~[atomix-cluster-0.26.2.jar:0.26.2]
1647777634596,	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
1647777634596,	at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
1647777634596,	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
1647777634596,	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
1647777634596,	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
1647777634596,	at java.lang.Thread.run(Unknown Source) ~[?:?]
1647777634596,2022-03-20 12:00:34.596 [io.zeebe.gateway.impl.broker.BrokerRequestManager] [Broker-0-zb-actors-0] WARN 
1647777634596,      io.zeebe.gateway - Failed to activate jobs for type GHI from partition 1
1647777634596,java.util.concurrent.TimeoutException: Request type command-api-1 timed out in 15000 milliseconds
1647777634596,	at io.atomix.cluster.messaging.impl.AbstractClientConnection$Callback.timeout(AbstractClientConnection.java:163) ~[atomix-cluster-0.26.2.jar:0.26.2]
1647777634596,	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
1647777634596,	at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
1647777634596,	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
1647777634596,	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
1647777634596,	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
1647777634596,	at java.lang.Thread.run(Unknown Source) ~[?:?]
1647777673271,2022-03-20 12:01:13.270 [Broker-0-HealthCheckService] [Broker-0-zb-actors-0] ERROR
1647777673271,"      io.zeebe.broker.system - Partition-1 failed, marking it as unhealthy"

Nihal S: Hey, We’re using the Zeebe Version - 0.26.2.

We understand that ideally we need to update the version of Zeebe but at this time we cant update due to compliance reasons. Appreciate if someone can help us to resolve this.

Pavan: Can this be because of this ==? https://github.com/camunda-cloud/zeebe/issues/6377 :thinking_face:
@Josh Wulf @Deepthi <@U6WCLLNGJ>

Deepthi: That exception does not look like it will lead to a problem. Most likely there was a new snapshot, so it is ok if this snapshot was not committed. Also it is not related to the issue linked.

Pavan: Oh ok.
Initially we suspected traffic(number of concurrent jobs) might have caused this. But yesterday this happened when there was very less traffic.

Pavan: Also, is this because of too many concurrent jobs somehow?
We noticed this error as-well: java.util.concurrent.TimeoutException: Request type command-api-1 timed out in 15000 milliseconds

Pavan: One other question is, If thread fails because of some exception, It should recover from it right?

Nithin B S: we also had this errors in logs

Deepthi: > One other question is, If thread fails because of some exception, It should recover from it right?
Yes. But since this is in 0.26.*, I don’t remember if there were any bugs that prevents it from recovering. We have improved error handling in snapshotting a lot and have fixed a lot of bugs in later versions.

Pavan: Oh thanks!
I understand the ideal approach is to upgrade Zeebe to latest version.
But because of the nature of the product, it might a take a month or two to even upgrade the version. (Or move to Camunda cloud eventually)
Is there any workaround till then with the current version?
Logs: https://pastebin.com/d5GRMTLA

Note: This post was generated by Slack Archivist from a conversation in the Zeebe Slack, a source of valuable discussions on Zeebe (get an invite). Someone in the Slack thought this was worth sharing!

If this post answered a question for you, hit the Like button - we use that to assess which posts to put into docs.