Zeebe Backpressure

Hi all, recently we adopted Camunda 8.2 Self Managed solution on Kubernetes using the helm chart provided in the Camunda documentation and have been facing Resource Exhausted error.

2023-12-21T08:28:55.441294942Z stderr F 08:28:55.440 | zeebe |  [io.camunda.zeebe:userTask] ERROR: Grpc Stream Error: 8 RESOURCE_EXHAUSTED: Expected to activate jobs of type 'io.camunda.zeebe:userTask', but no jobs available and at least one broker returned 'RESOURCE_EXHAUSTED'. Please try again later.

Configuration:
3 broker
3 partition
3 replication factors.

** How did we try to resolve it? **
We increased the memory allocated and restarted the broker. After this, 2 brokers were ready BUT the third one started giving 503 error (though pods are not getting restarted)

** Affected Pod logs**

2023-12-21 13:05:26.501 [] [Thread-14] INFO
      io.atomix.raft.partition.impl.RaftPartitionServer - RaftPartitionServer{raft-partition-partition-2} - Starting server for partition PartitionId{id=2, group=raft-partition}
2023-12-21 13:05:26.500 [] [raft-server-0-raft-partition-partition-3] INFO
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-3} - Transitioning to FOLLOWER
2023-12-21 13:05:26.507 [] [raft-server-0-raft-partition-partition-3] INFO
      io.atomix.raft.impl.DefaultRaftServer - RaftServer{raft-partition-partition-3} - Server join completed. Waiting for the server to be READY
2023-12-21 13:05:27.976 [] [raft-server-0-raft-partition-partition-3] INFO
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-3} - Found leader 2
2023-12-21 13:05:27.981 [] [raft-server-0-raft-partition-partition-3] INFO
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-3} - Setting firstCommitIndex to 67793541. RaftServer is ready only after it has committed events upto this index
2023-12-21 13:05:27.982 [] [raft-server-0-raft-partition-partition-3] INFO
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-3} - Commit index is 67793325. RaftServer is ready only after it has committed events up to index 67793541
2023-12-21 13:05:33.062 [] [raft-server-0-raft-partition-partition-2] INFO
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-2} - Transitioning to FOLLOWER
2023-12-21 13:05:33.062 [] [Thread-14] INFO
      io.atomix.raft.partition.impl.RaftPartitionServer - RaftPartitionServer{raft-partition-partition-1} - Starting server for partition PartitionId{id=1, group=raft-partition}
2023-12-21 13:05:33.063 [] [raft-server-0-raft-partition-partition-2] INFO
      io.atomix.raft.impl.DefaultRaftServer - RaftServer{raft-partition-partition-2} - Server join completed. Waiting for the server to be READY
2023-12-21 13:05:33.198 [] [raft-server-0-raft-partition-partition-2] INFO
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-2} - Found leader 1
2023-12-21 13:05:33.249 [] [raft-server-0-raft-partition-partition-3] INFO
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-3} - Commit index is 67793553. RaftServer is ready
2023-12-21 13:05:33.250 [] [raft-server-0-raft-partition-partition-3] INFO
      io.atomix.raft.partition.impl.RaftPartitionServer - RaftPartitionServer{raft-partition-partition-3} - Successfully started server for partition PartitionId{id=3, group=raft-partition} in 23865ms
2023-12-21 13:05:37.074 [] [raft-server-0-raft-partition-partition-2] INFO
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-2} - Setting firstCommitIndex to 68024967. RaftServer is ready only after it has committed events upto this index
2023-12-21 13:05:37.074 [] [raft-server-0-raft-partition-partition-2] INFO
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-2} - Commit index is 67962333. RaftServer is ready only after it has committed events up to index 68024967
2023-12-21 13:05:37.155 [] [raft-server-0-raft-partition-partition-2] INFO
      io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Started receiving new snapshot FileBasedReceivedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/pending/68023836-65413-68072070-68072080-1, snapshotStore=Broker-0-SnapshotStore-2, metadata=FileBasedSnapshotId{index=68023836, term=65413, processedPosition=68072070, exporterPosition=68072080}} from 1
2023-12-21 13:05:42.231 [] [raft-server-0-raft-partition-partition-1] INFO
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-1} - Transitioning to FOLLOWER
2023-12-21 13:05:42.231 [] [raft-server-0-raft-partition-partition-1] INFO
      io.atomix.raft.impl.DefaultRaftServer - RaftServer{raft-partition-partition-1} - Server join completed. Waiting for the server to be READY
2023-12-21 13:05:42.289 [] [raft-server-0-raft-partition-partition-1] INFO
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-1} - Found leader 1
2023-12-21 13:05:49.820 [] [raft-server-0-raft-partition-partition-1] INFO
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-1} - Setting firstCommitIndex to 68238847. RaftServer is ready only after it has committed events upto this index
2023-12-21 13:05:49.821 [] [raft-server-0-raft-partition-partition-1] INFO
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-1} - Commit index is 68127423. RaftServer is ready only after it has committed events up to index 68238847
2023-12-21 13:05:49.852 [] [raft-server-0-raft-partition-partition-2] INFO
      io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Rolling back snapshot FileBasedReceivedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/pending/68023836-65413-68072070-68072080-1, snapshotStore=Broker-0-SnapshotStore-2, metadata=FileBasedSnapshotId{index=68023836, term=65413, processedPosition=68072070, exporterPosition=68072080}}
2023-12-21 13:05:49.853 [] [raft-server-0-raft-partition-partition-2] INFO
      io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Started receiving new snapshot FileBasedReceivedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/pending/68023836-65413-68072070-68072080-2, snapshotStore=Broker-0-SnapshotStore-2, metadata=FileBasedSnapshotId{index=68023836, term=65413, processedPosition=68072070, exporterPosition=68072080}} from 1
2023-12-21 13:27:17.067 [] [raft-server-0-raft-partition-partition-2] INFO
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-2} - Found leader 2
2023-12-21 13:27:17.112 [] [raft-server-0-raft-partition-partition-2] INFO
      io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Started receiving new snapshot FileBasedReceivedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/pending/68035306-65488-68083723-68083606-84, snapshotStore=Broker-0-SnapshotStore-2, metadata=FileBasedSnapshotId{index=68035306, term=65488, processedPosition=68083723, exporterPosition=68083606}} from 2
2023-12-21 13:27:28.722 [] [raft-server-0-raft-partition-partition-2] INFO
      io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Rolling back snapshot FileBasedReceivedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/pending/68035306-65488-68083723-68083606-84, snapshotStore=Broker-0-SnapshotStore-2, metadata=FileBasedSnapshotId{index=68035306, term=65488, processedPosition=68083723, exporterPosition=68083606}}
2023-12-21 13:27:33.606 [] [raft-server-0-raft-partition-partition-1] INFO
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-1} - Found leader 2
2023-12-21 13:27:33.649 [] [raft-server-0-raft-partition-partition-1] INFO
      io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-1}{role=FOLLOWER} - Started receiving new snapshot FileBasedReceivedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/pending/68246678-61620-68289730-68289744-74, snapshotStore=Broker-0-SnapshotStore-1, metadata=FileBasedSnapshotId{index=68246678, term=61620, processedPosition=68289730, exporterPosition=68289744}} from 2

I am new to this, might have missed some information that can be of help here. Any help would be appreciated.

Thank You!

Attaching Grafana Screen shot for your all reference

To follow up, we see a spike on partitions at random intervals. Wanted to understand on why this behaviour happens, as this leads to same error
ERROR: Grpc Stream Error: 8 RESOURCE_EXHAUSTED: Expected to activate jobs of type 'jobName', but no jobs available and at least one broker returned 'RESOURCE_EXHAUSTED'. Please try again later.

you can see Processing Queue Size data in the attached screen shot. Only Partition 1 spikes at the interval between [12:55 - 12:56] Does this happen because of the exporter?

Any help would be appreciated.

tagging @jwulf for visibility. Thanks.