Hi,
It looks like the broker is facing serious issues exporting towards Hazelcast, but we cant pinpoint the excact issue and root cause.
What we see is the broker is up:
./zbctl --address 0.0.0.0:26500 --insecure status
Cluster size: 1
Partitions count: 1
Replication factor: 1
Gateway version: 0.25.3
Brokers:
Broker 0 - 10.0.0.111:26501
Version: 0.25.3
Partition 1 : Leader
We also see we can start a new workflow instance:
./zbctl --address 0.0.0.0:26500 --insecure create instance demoProcess
{
"workflowKey": 2251799813685877,
"bpmnProcessId": "demoProcess",
"version": 1,
"workflowInstanceKey": 2251799818453600
}
The broker is throwing a continious errors:
2021-02-12 11:24:47.521 [Broker-0-Exporter-1] [Broker-0-zb-fs-workers-1] ERROR io.zeebe.util.retry.EndlessRetryStrategy - Catched exception class java.lang.IllegalArgumentException with message invalid offset: -17568, will retry... java.lang.IllegalArgumentException: invalid offset: -17568 at org.agrona.concurrent.UnsafeBuffer.boundsCheckWrap(UnsafeBuffer.java:1702) ~[agrona-1.8.0.jar:1.8.0] at org.agrona.concurrent.UnsafeBuffer.wrap(UnsafeBuffer.java:256) ~[agrona-1.8.0.jar:1.8.0] at io.zeebe.msgpack.spec.MsgPackReader.wrap(MsgPackReader.java:49) ~[zeebe-msgpack-core-0.25.3.jar:0.25.3] at io.zeebe.msgpack.UnpackedObject.wrap(UnpackedObject.java:29) ~[zeebe-msgpack-value-0.25.3.jar:0.25.3] at io.zeebe.logstreams.impl.log.LoggedEventImpl.readValue(LoggedEventImpl.java:135) ~[zeebe-logstreams-0.25.3.jar:0.25.3] at io.zeebe.engine.processing.streamprocessor.RecordValues.readRecordValue(RecordValues.java:35) ~[zeebe-workflow-engine-0.25.3.jar:0.25.3] at io.zeebe.broker.exporter.stream.ExporterDirector$RecordExporter.wrap(ExporterDirector.java:328) ~[zeebe-broker-0.25.3.jar:0.25.3] at io.zeebe.broker.exporter.stream.ExporterDirector.lambda$exportEvent$6(ExporterDirector.java:253) ~[zeebe-broker-0.25.3.jar:0.25.3] at io.zeebe.util.retry.ActorRetryMechanism.run(ActorRetryMechanism.java:36) ~[zeebe-util-0.25.3.jar:0.25.3] at io.zeebe.util.retry.EndlessRetryStrategy.run(EndlessRetryStrategy.java:50) ~[zeebe-util-0.25.3.jar:0.25.3] at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) [zeebe-util-0.25.3.jar:0.25.3] at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.25.3.jar:0.25.3] at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-0.25.3.jar:0.25.3] at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:94) [zeebe-util-0.25.3.jar:0.25.3] at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78) [zeebe-util-0.25.3.jar:0.25.3] at io.zeebe.util.sched.ActorThread.run(ActorThread.java:191) [zeebe-util-0.25.3.jar:0.25.3]
Hazelcast from logs is not reporting any issues:
2021-02-12 12:04:45,474 [ INFO] [hz.happy_fermi.HealthMonitor] [c.h.i.d.HealthMonitor]: [hazelcast1]:5701 [dev] [4.1] processors=4, physical.memory.total=15.6G, physical.memory.free=346.5M, swap.space.total=0, swap.space.free=0, heap.memory.used=60.9M, heap.memory.free=242.1M, heap.memory.total=303.0M, heap.memory.max=11.1G, heap.memory.used/total=20.10%, heap.memory.used/max=0.54%, minor.gc.count=5, minor.gc.time=67ms, major.gc.count=2, major.gc.time=110ms, load.process=0.00%, load.system=0.00%, load.systemAverage=4.37, thread.count=52, thread.peakCount=56, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=125, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=1, clientEndpoint.count=2, connection.active.count=2, client.connection.count=0, connection.count=0
How to move forward from here?
Deleting the whoe raft partition data will probably fix the issue for now, but it will cause issues for active workflow instances.
Broker version: 0.25.3
Any suggestions how to troubleshoot further?