Failed to write command MESSAGE_SUBSCRIPTION DELETE from 0 to logstream

neki · January 26, 2023, 7:33am

Zeebe broker works fine, but at some point it starts to fill logs with warns:
io.camunda.zeebe.gateway: …Failed to activate jobs…
Client isn’t provides much of a workload though. After that warns in broker log changes to: io.camunda.zeebe.broker.transport: …Failed to write command…

While all this happen, there are a lot of disk pressure for broker storage mounts and almost none of network traffic. Cpu also goes up to limits.

Logs

2023-01-25 12:37:34.397 [ActivateJobsHandler] [Broker-0-zb-actors-0] WARN 
      io.camunda.zeebe.gateway - Failed to activate jobs for type employee-module-save-interview-results-task-z from partition 1
java.util.concurrent.TimeoutException: Request ProtocolRequest{id=1025445, subject=command-api-1, sender=0.0.0.0:26502, payload=byte[]{length=217, hash=-1017308190}} to 0.0.0.0:26501 timed out in PT15S
 at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$sendAndReceive$4(NettyMessagingService.java:230) ~[zeebe-atomix-cluster-8.1.5.jar:8.1.5]
 at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
 at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
 at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
 at java.lang.Thread.run(Unknown Source) ~[?:?]
2023-01-25 12:38:23.686 [Broker-0-SnapshotDirector-1] [Broker-0-zb-actors-0] INFO 
      io.camunda.zeebe.logstreams.snapshot - Finished taking temporary snapshot, need to wait until last written event position 10727942 is committed, current commit position is 10081276. After that snapshot will be committed.
2023-01-25 12:39:44.173 [Broker-0-InterPartitionCommandReceiverActor-1] [Broker-0-zb-actors-0] WARN 
      io.camunda.zeebe.broker.transport - Failed to write command MESSAGE_SUBSCRIPTION DELETE from 0 to logstream
2023-01-25 12:39:44.174 [Broker-0-InterPartitionCommandReceiverActor-1] [Broker-0-zb-actors-0] WARN 
      io.camunda.zeebe.broker.transport - Failed to write command MESSAGE_SUBSCRIPTION DELETE from 0 to logstream
2023-01-25 12:39:44.174 [Broker-0-InterPartitionCommandReceiverActor-1] [Broker-0-zb-actors-0] WARN 
      io.camunda.zeebe.broker.transport - Failed to write command MESSAGE_SUBSCRIPTION DELETE from 0 to logstream
2023-01-25 12:39:44.174 [Broker-0-InterPartitionCommandReceiverActor-1] [Broker-0-zb-actors-0] WARN

This is a second time for a week (after the first time we dropped all broker data) and we out of ideas how to solve this.

Docker image camunda/zeebe:8.1.5

Zeebe config

zeebe:
  broker:
    stepTimeout: 5m
    gateway:
      enable: true
      network:
        host: 0.0.0.0
        port: 26500
        minKeepAliveInterval: 30s
      cluster:
        requestTimeout: 15s
      threads:
        managementThreads: 1
      monitoring:
        enabled: true
      security:
        enabled: false
    network:
      host: 0.0.0.0
      advertisedHost: 0.0.0.0
      portOffset: 0
      maxMessageSize: 4MB
      commandApi:
        host: 0.0.0.0
        port: 26501
      monitoringApi:
        host: 0.0.0.0
        port: 9600
    data:
      directories: [ data ]
      logSegmentSize: 512MB
      snapshotPeriod: 15m
    backpressure:
      algorithm: "fixed"
      fixed:
        limit: 1000
    exporters:
      elasticsearch:
        className: io.camunda.zeebe.exporter.ElasticsearchExporter
        args:
          url: http://elasticsearch-svc:9200
          bulk:
            delay: 5
            size: 1000
          index:
            prefix: zeebe-record
            createTemplate: true
            command: false
            event: true
            rejection: false
            deployment: true
            incident: true
            job: true
            message: false
            messageSubscription: false
            raft: false
            workflowInstance: true
            workflowInstanceSubscription: false

lzgabel · January 26, 2023, 12:51pm

Hi @nekHi.

Through your description, it seems to be consistent with what we have encountered, and you can track this issue (When there is a big chunk of expired messages, then Zeebe fails to write the batch of corresponding commands · Issue #11480 · camunda/zeebe · GitHub) for the latest status.