CompleteJob And FailJob Latency, return so slowly sometimes

We are using Camunda version 8.5.5 and .NET 8 to progress our bpmn diagrams through workers and provide solutions by listening with an exporter. However, we are experiencing significant issues with the CompleteJob and FailJob functionalities. These problems are causing serious performance issues, and we are currently at a loss regarding how to resolve them.

In each step, we perform the task and call CompleteJob with grpc, but while it sometimes executes in milliseconds, at other times, it results in timeout errors after 15 seconds. I am willing to share all our configurations with you.

Increasing the timeout is not a solution for me, as we are receiving complaints from our customers about this issue. In fact, no task should take as long as 15 seconds to complete.

If anyone has any suggestions or can assist us with this, I would greatly appreciate it.

We have 32 worker created like;
public IJobWorker CreateWorker(string jobType, CancellationToken cancellationToken = default)
{
return zeebeClient.NewWorker()
.JobType(jobType)
.Handler(WorkerHandler)
.HandlerThreads(4)
.MaxJobsActive(50)
.Name(MACHINE_NAME)
.PollingTimeout(10)
.PollInterval(10)
.Timeout(5)
.Open();
}

if(variables is not null)
await jobClient.NewCompleteJobCommand(job.Key)
.Variables(variables )
.Send(cancellationToken);
}
else
{
await jobClient.NewCompleteJobCommand(job.Key)
.Send(cancellationToken);
}

Errors:

Status(StatusCode=“DeadlineExceeded”, Detail=“Time out between gateway and broker: Request command-api-3 to camunda-zeebe-2.camunda-zeebe.camunda.svc:26501 timed out in PT15S”)
rpc.grpc.status_code

Despite these errors, the workflow actually completes successfully. However, since the worker returns an error, my part also behaves as if it’s failed.

Status(StatusCode=“Internal”, Detail=“Unexpected error occurred during the request processing: Connection RemoteClientConnection{channel=[id: 0xbd54555c, L:/10.42.5.179:58954 ! R:camunda-zeebe-1.camunda-zeebe.camunda.svc.cluster.local/10.42.1.127:26501]} was closed”) sometimes this error return from zeebe

Would you please explain it if possible.

How is your workflow looks like? Are you calling external worker from your service task?

Are you using Cloud or Self hosted?

If Self Hosted, are you using any cloud providers or on-premises K8s environment?

Would you please share your chart values if possible?

Also please share the POD logs on gateway and zeebe. This will give some more details.


This is my example workflow. I cant share real worfklow. The worker listens to the service task, and based on the jobType, my own process progresses using a handler. For example, I fetch some data from my database, and if the response is 200, I send CompleteJob. However, sometimes CompleteJob takes a long time to return a response.

I am using a self-hosted environment on Kubernetes (K8s).

Here are today’s Zeebe logs. Most of the time, the tasks complete successfully.
2024-11-20 09:43:39.729 [Broker-0] [zb-actors-1] [HealthCheckService] WARN
io.camunda.zeebe.broker.system - Partition-1 failed, marking it as unhealthy: Partition-1{status=UNHEALTHY, issue=HealthIssue[message=null, throwable=null, cause=StreamProcessor-1{status=UNHEALTHY, issue=HealthIssue[message=actor appears blocked, throwable=null, cause=null]}]}
2024-11-20 09:43:39.738 [Broker-0] [zb-fs-workers-0] [Exporter-1] INFO
io.camunda.zeebe.broker.exporter.aura - 2251800010240100 process instance 2251800010240367 job key removed from processJobs successfully, job intent: COMPLETED
2024-11-20 09:43:40.954 [Broker-0] [zb-fs-workers-0] [Exporter-1] WARN
io.camunda.zeebe.broker.exporter.aura - Created jobKey: 2251800010240537, jobIntent: CREATED
2024-11-20 09:43:40.954 [Broker-0] [zb-fs-workers-0] [Exporter-1] INFO
io.camunda.zeebe.broker.exporter.aura - 2251800010240480 process instance 2251800010240537 job key added to processJobs successfully, job intent: CREATED, errorCode
2024-11-20 09:44:40.049 [Broker-0] [zb-actors-1] [HealthCheckService] INFO
io.camunda.zeebe.broker.system - Partition-1 recovered, marking it as healthy
2024-11-20 09:46:25.418 [Broker-2] [zb-fs-workers-0] [SnapshotStore-3] INFO
io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore - Committed new snapshot 350814193-131-394508928-394508935
2024-11-20 09:46:42.632 [Broker-0] [zb-fs-workers-2] [SnapshotStore-1] INFO
io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore - Committed new snapshot 351504856-137-397594653-397594661
2024-11-20 09:50:01.033 [Broker-2] [zb-fs-workers-1] [Exporter-3] WARN
io.camunda.zeebe.broker.exporter.aura - Created jobKey: 6755399636279751, jobIntent: CREATED
2024-11-20 09:50:01.033 [Broker-2] [zb-fs-workers-1] [Exporter-3] INFO
io.camunda.zeebe.broker.exporter.aura - 6755399636279739 process instance 6755399636279751 job key added to processJobs successfully, job intent: CREATED, errorCode
2024-11-20 09:50:01.373 [Broker-2] [zb-fs-workers-1] [Exporter-3] INFO
io.camunda.zeebe.broker.exporter.aura - 6755399636279739 process instance 6755399636279751 job key removed from processJobs successfully, job intent: COMPLETED
2024-11-20 09:50:01.392 [Broker-2] [zb-fs-workers-1] [Exporter-3] WARN
io.camunda.zeebe.broker.exporter.aura - Created jobKey: 6755399636279776, jobIntent: CREATED
2024-11-20 09:50:01.392 [Broker-2] [zb-fs-workers-1] [Exporter-3] INFO
io.camunda.zeebe.broker.exporter.aura - 6755399636279739 process instance 6755399636279776 job key added to processJobs successfully, job intent: CREATED, errorCode
2024-11-20 09:50:01.597 [Broker-2] [zb-fs-workers-2] [Exporter-3] INFO
io.camunda.zeebe.broker.exporter.aura - 6755399636279739 process instance 6755399636279776 job key removed from processJobs successfully, job intent: COMPLETED
2024-11-20 09:50:01.607 [Broker-2] [zb-fs-workers-2] [Exporter-3] INFO
io.camunda.zeebe.broker.exporter.aura - 6755399636279739 process instance removed from processVariables successfully
2024-11-20 09:50:14.619 [atomix-cluster-heartbeat-sender] WARN
io.atomix.cluster.protocol.swim.sync - 0 - Failed to synchronize membership with Member{id=1, address=camunda-zeebe-1.camunda-zeebe.camunda.svc:26502, properties={brokerInfo=EADJAAAABAABAAAAAwAAAAMAAAABAAAAAAABCgAAAGNvbW1hbmRBcGkvAAAAY2FtdW5kYS16ZWViZS0xLmNhbXVuZGEtemVlYmUuY2FtdW5kYS5zdmM6MjY1MDEFAAECAAAAAAwAAQIAAACJAAAAAAAAAAUAAAA4LjUuNQUAAQIAAAAB}, version=8.5.5, timestamp=1732053967489, state=ALIVE, incarnationNumber=1732053967516}
java.util.concurrent.TimeoutException: Request atomix-membership-sync to camunda-zeebe-1.camunda-zeebe.camunda.svc:26502 timed out in PT0.1S
at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$sendAndReceive$4(NettyMessagingService.java:261) ~[zeebe-atomix-cluster-8.5.5.jar:8.5.5]
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
at java.base/java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
at java.base/java.lang.Thread.run(Unknown Source) [?:?]
2024-11-20 09:51:14.148 [Broker-0] [zb-fs-workers-2] [Exporter-1] WARN
io.camunda.zeebe.broker.exporter.aura - Created jobKey: 2251800010245302, jobIntent: CREATED

Thanks for sharing the details. Would you please share some more details.

Are you running it as single node k8 cluster or multiple node? Please do share the chart values if you have used Camunda 8 Helm chart.

There is an open issue already reported: Zeebe gateway and brokers spamming atomix-cluster-heartbeat-sender logs, failed to probe · Issue #14845 · camunda/camunda · GitHub

3 node k8s cluster
we dont use Helm chart

Please try setting these values for your broker and gateway and let us know.

ZEEBE_BROKER_CLUSTER_MEMBERSHIP_PROBETIMEOUT=100ms
ZEEBE_GATEWAY_CLUSTER_MEMBERSHIP_PROBETIMEOUT=100ms

hi @cpbpm , We have changed this value, but we are still encountering the same error.

Status(StatusCode=“DeadlineExceeded”, Detail=“Time out between gateway and broker: Request command-api-3 to camunda-zeebe-2.camunda-zeebe.camunda.svc:26501 timed out in PT15S”)
rpc.grpc.status_code

I am sharing our configuration with you along with the new Zeebe log.

I found that the default value for ZEEBE_GATEWAY_CLUSTER_REQUESTTIMEOUT is 15 seconds. However, as I mentioned, I don’t want this process to take 15 seconds at all; this duration is too high for us. I’m not sure what would happen if I increase or decrease this value. Since this is not an issue I am experiencing locally, directly experimenting with this setting is risky for us because our customers are actively using the environment. Could you please adjust or guide me on this?

The default value you provided is already 100 ms, and we are seeing this error in the Zeebe logs as well.

It seems that our issue has multiple causes.


2024-11-21T07:34:44.907+03:00       io.atomix.cluster.protocol.swim.probe - camunda-zeebe-gateway-5f695d7ccf-2h5mz - Failed to probe 2
2024-11-21T07:34:44.907+03:00 **java.util.concurrent.TimeoutException: Request atomix-membership-probe to camunda-zeebe-2.camunda-zeebe.camunda.svc:26502 timed out in PT0.1S**
2024-11-21T07:34:44.907+03:00 	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$sendAndReceive$4(NettyMessagingService.java:261) ~[zeebe-atomix-cluster-8.5.5.jar:8.5.5]
2024-11-21T07:34:44.907+03:00 	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
2024-11-21T07:34:44.907+03:00 	at java.base/java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
2024-11-21T07:34:44.907+03:00 	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
2024-11-21T07:34:44.907+03:00 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
2024-11-21T07:34:44.907+03:00 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
2024-11-21T07:34:44.907+03:00 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T07:34:44.907+03:00 	at java.base/java.lang.Thread.run(Unknown Source) [?:?]
2024-11-21T09:29:58.777+03:00 2024-11-21 06:29:58.777 [Gateway-camunda-zeebe-gateway-5f695d7ccf-2h5mz] [zb-actors-0] [ActivateJobsHandler] WARN 
2024-11-21T09:29:58.777+03:00       io.camunda.zeebe.gateway - Failed to activate jobs for type document-management from partition 1
2024-11-21T09:29:58.777+03:00 io.atomix.cluster.messaging.MessagingException$ConnectionClosed: Connection RemoteClientConnection{channel=[id: 0xa94f1af3, L:/10.42.2.167:35708 ! R:camunda-zeebe-0.camunda-zeebe.camunda.svc.cluster.local/10.42.5.65:26501]} was closed
2024-11-21T09:29:58.777+03:00 	at io.atomix.cluster.messaging.impl.AbstractClientConnection.close(AbstractClientConnection.java:76) ~[zeebe-atomix-cluster-8.5.5.jar:8.5.5]
2024-11-21T09:29:58.777+03:00 	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$getOrCreateClientConnection$38(NettyMessagingService.java:702) ~[zeebe-atomix-cluster-8.5.5.jar:8.5.5]
2024-11-21T09:29:58.777+03:00 	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.777+03:00 	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.777+03:00 	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.777+03:00 	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:625) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:105) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:1161) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:753) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:729) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:619) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.DefaultChannelPipeline$HeadContext.close(DefaultChannelPipeline.java:1349) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:755) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.AbstractChannelHandlerContext.access$1200(AbstractChannelHandlerContext.java:61) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.AbstractChannelHandlerContext$11.run(AbstractChannelHandlerContext.java:738) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:405) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
zeebe:
 affinity:
   podAntiAffinity:
     requiredDuringSchedulingIgnoredDuringExecution:
       - labelSelector:
           matchExpressions:
             - key: app.kubernetes.io/component
               operator: In
               values:
                 - zeebe-broker
         topologyKey: kubernetes.io/hostname
 clusterSize: '3'
 command: []
 configMap:
   defaultMode: 492
 configuration: ''
 containerSecurityContext:
   allowPrivilegeEscalation: false
   privileged: false
   readOnlyRootFilesystem: true
   runAsNonRoot: true
   runAsUser: 1001
   seccompProfile:
     type: RuntimeDefault
 cpuThreadCount: '3'
 debug: false
 dnsConfig: {}
 dnsPolicy: ''
 enabled: true
 env:
   - name: ZEEBE_BROKER_DATA_SNAPSHOTPERIOD
     value: 5m
   - name: ZEEBE_BROKER_GATEWAY_CLUSTER_REQUESTTIMEOUT
     value: PT2S
   - name: ZEEBE_GATEWAY_CLUSTER_REQUESTTIMEOUT
     value: PT2S
   - name: ZEEBE_BROKER_DATA_DISK_FREESPACE_REPLICATION
     value: 1GB
   - name: ZEEBE_BROKER_DATA_DISK_FREESPACE_PROCESSING
     value: 2GB
   - name: ZEEBE_AURA_DAPR_ADDRESS
     value: http://localhost:3500
   - name: ZEEBE_BROKER_EXPORTERS_AURA_CLASSNAME
     value: io.zeebe.aura.exporter.AuraExporter
   - name: ZEEBE_BROKER_EXPORTERS_AURA_JARPATH
     value: >-
       /usr/local/zeebe/exporters/zeebe-aura-exporter-1.0-jar-with-dependencies.jar
   - name: ZEEBE_AURA_DAPR_PUBSUB_NAME
     value: zeebe-pubsub
   - name: ZEEBE_AURA_DISABLED_VALUE_TYPES
     value: JOB_BATCH
   - name: ZEEBE_AURA_DISABLED_RECORD_TYPES
     value: COMMAND
   - name: ZEEBE_AURA_REQUEST_TOPIC
     value: REQUEST-EXPORTER
   - name: ZEEBE_AURA_TRANSACTION_TOPIC
     value: TRANSACTION-EXPORTER
   - name: ZEEBE_AURA_DAPR_TOPIC
     value: DAPR-EXPORTER
   - name: ZEEBE_AURA_CORRELATION_ID_KEY
     value: CorrelationId
   - name: ZEEBE_AURA_ACTIVITY_ID_KEY
     value: ActivityId
   - name: ZEEBE_AURA_RESPONSE_TOPIC
     value: RESPONSE-EXPORTER
   - name: ZEEBE_AURA_COMPENSATION_TOPIC
     value: COMPENSATION-EXPORTER
   - name: ZEEBE_AURA_MACHINENAME_VARIABLE_KEY
     value: CreatedOnMachineName
   - name: ZEEBE_AURA_DAPR_TOPIC_IS_BULK
     value: 'true'
 envFrom: []
 extraConfiguration: {}
 extraInitContainers: []
 extraVolumeMounts: []
 extraVolumes: []
 image:
   pullSecrets: []
   registry: registry.com/uat
   repository: camunda/zeebe
   tag: 8.5.5_20240903_113405
 initContainers: []
 ioThreadCount: '3'
 javaOpts: >-
   -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/zeebe/data
   -XX:ErrorFile=/usr/local/zeebe/data/zeebe_error%p.log
   -XX:+ExitOnOutOfMemoryError
 livenessProbe:
   enabled: false
   failureThreshold: 5
   initialDelaySeconds: 30
   periodSeconds: 30
   probePath: /actuator/health/readiness
   scheme: HTTP
   successThreshold: 1
   timeoutSeconds: 1
 log4j2: ''
 logLevel: info
 metrics:
   prometheus: /actuator/prometheus
 nodeSelector: {}
 partitionCount: '3'
 persistenceType: disk
 podAnnotations:
   dapr.io/app-id: zeebe-broker
   dapr.io/app-port: '9600'
   dapr.io/app-protocol: h2c
   dapr.io/disable-builtin-k8s-secret-store: 'true'
   dapr.io/enabled: 'true'
   dapr.io/sidecar-cpu-limit: 200m
   dapr.io/sidecar-memory-limit: 400Mi
 podDisruptionBudget:
   enabled: false
   maxUnavailable: 1
   minAvailable: null
 podLabels: {}
 podSecurityContext:
   fsGroup: 1000
   runAsNonRoot: true
   seccompProfile:
     type: RuntimeDefault
 priorityClassName: ''
 pvcAccessModes:
   - ReadWriteOnce
 pvcAnnotations: {}
 pvcSize: 32Gi
 pvcStorageClassName: ''
 readinessProbe:
   enabled: true
   failureThreshold: 5
   initialDelaySeconds: 30
   periodSeconds: 30
   probePath: /actuator/health/readiness
   scheme: HTTP
   successThreshold: 1
   timeoutSeconds: 1
 replicationFactor: '1'
 resources:
   limits:
     cpu: '2'
     memory: 4Gi
   requests:
     cpu: 800m
     memory: 1200Mi
 retention:
   enabled: false
   minimumAge: 30d
   policyName: zeebe-record-retention-policy
 service:
   annotations: {}
   commandName: command
   commandPort: 26501
   extraPorts: []
   httpName: http
   httpPort: 9600
   internalName: internal
   internalPort: 26502
   type: ClusterIP
 serviceAccount:
   annotations: {}
   automountServiceAccountToken: false
   enabled: true
   name: ''
 sidecars: []
 startupProbe:
   enabled: false
   failureThreshold: 5
   initialDelaySeconds: 30
   periodSeconds: 30
   probePath: /actuator/health/startup
   scheme: HTTP
   successThreshold: 1
   timeoutSeconds: 1
 strategy:
   type: RollingUpdate
 tolerations: []```

hi @cpbpm , We have changed this value, but we are still encountering the same error from worker CompleteJob function. I am sharing our configuration with you along with the new Zeebe log.

Same Error:
Status(StatusCode="DeadlineExceeded", Detail="Time out between gateway and broker: Request command-api-3 to camunda-zeebe-2.camunda-zeebe.camunda.svc:26501 timed out in PT15S")
rpc.grpc.status_code	

I found that the default value for ZEEBE_GATEWAY_CLUSTER_REQUESTTIMEOUT is 15 seconds. However, as I mentioned, I don’t want this process to take 15 seconds at all; this duration is too high for us. I’m not sure what would happen if I increase or decrease this value. Since this is not an issue I am experiencing locally, directly experimenting with this setting is risky for us because our customers are actively using the environment. Could you please adjust or guide me on this?

The default value you provided is already 100 ms, and we are seeing this error in the Zeebe logs as well. It seems that our issue has multiple causes.

2024-11-21T07:34:44.907+03:00       io.atomix.cluster.protocol.swim.probe - camunda-zeebe-gateway-5f695d7ccf-2h5mz - Failed to probe 2
2024-11-21T07:34:44.907+03:00 java.util.concurrent.TimeoutException: Request atomix-membership-probe to camunda-zeebe-2.camunda-zeebe.camunda.svc:26502 timed out in PT0.1S
2024-11-21T07:34:44.907+03:00 	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$sendAndReceive$4(NettyMessagingService.java:261) ~[zeebe-atomix-cluster-8.5.5.jar:8.5.5]
2024-11-21T07:34:44.907+03:00 	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
2024-11-21T07:34:44.907+03:00 	at java.base/java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
2024-11-21T07:34:44.907+03:00 	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
2024-11-21T07:34:44.907+03:00 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
2024-11-21T07:34:44.907+03:00 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
2024-11-21T07:34:44.907+03:00 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T07:34:44.907+03:00 	at java.base/java.lang.Thread.run(Unknown Source) [?:?]
2024-11-21T09:29:58.777+03:00 2024-11-21 06:29:58.777 [Gateway-camunda-zeebe-gateway-5f695d7ccf-2h5mz] [zb-actors-0] [ActivateJobsHandler] WARN 
2024-11-21T09:29:58.777+03:00       io.camunda.zeebe.gateway - Failed to activate jobs for type document-management from partition 1
2024-11-21T09:29:58.777+03:00 io.atomix.cluster.messaging.MessagingException$ConnectionClosed: Connection RemoteClientConnection{channel=[id: 0xa94f1af3, L:/10.42.2.167:35708 ! R:camunda-zeebe-0.camunda-zeebe.camunda.svc.cluster.local/10.42.5.65:26501]} was closed
2024-11-21T09:29:58.777+03:00 	at io.atomix.cluster.messaging.impl.AbstractClientConnection.close(AbstractClientConnection.java:76) ~[zeebe-atomix-cluster-8.5.5.jar:8.5.5]
2024-11-21T09:29:58.777+03:00 	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$getOrCreateClientConnection$38(NettyMessagingService.java:702) ~[zeebe-atomix-cluster-8.5.5.jar:8.5.5]
2024-11-21T09:29:58.777+03:00 	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.777+03:00 	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.777+03:00 	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.777+03:00 	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:625) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:105) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:1161) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:753) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:729) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:619) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.DefaultChannelPipeline$HeadContext.close(DefaultChannelPipeline.java:1349) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:755) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.AbstractChannelHandlerContext.access$1200(AbstractChannelHandlerContext.java:61) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.AbstractChannelHandlerContext$11.run(AbstractChannelHandlerContext.java:738) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:405) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
2024-11-21T09:29:58.778+03:00 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]
type or paste code here

zeebe yaml

zeebe:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                  - zeebe-broker
          topologyKey: kubernetes.io/hostname
  clusterSize: '3'
  command: []
  configMap:
    defaultMode: 492
  configuration: ''
  containerSecurityContext:
    allowPrivilegeEscalation: false
    privileged: false
    readOnlyRootFilesystem: true
    runAsNonRoot: true
    runAsUser: 1001
    seccompProfile:
      type: RuntimeDefault
  cpuThreadCount: '3'
  debug: false
  dnsConfig: {}
  dnsPolicy: ''
  enabled: true
  env:
    - name: ZEEBE_BROKER_DATA_SNAPSHOTPERIOD
      value: 5m
    - name: ZEEBE_BROKER_GATEWAY_CLUSTER_REQUESTTIMEOUT
      value: PT2S
    - name: ZEEBE_GATEWAY_CLUSTER_REQUESTTIMEOUT
      value: PT2S
    - name: ZEEBE_BROKER_DATA_DISK_FREESPACE_REPLICATION
      value: 1GB
    - name: ZEEBE_BROKER_DATA_DISK_FREESPACE_PROCESSING
      value: 2GB
    - name: ZEEBE_AURA_DAPR_ADDRESS
      value: http://localhost:3500
    - name: ZEEBE_BROKER_EXPORTERS_AURA_CLASSNAME
      value: io.zeebe.aura.exporter.AuraExporter
    - name: ZEEBE_BROKER_EXPORTERS_AURA_JARPATH
      value: >-
        /usr/local/zeebe/exporters/zeebe-aura-exporter-1.0-jar-with-dependencies.jar
    - name: ZEEBE_AURA_DAPR_PUBSUB_NAME
      value: zeebe-pubsub
    - name: ZEEBE_AURA_DISABLED_VALUE_TYPES
      value: JOB_BATCH
    - name: ZEEBE_AURA_DISABLED_RECORD_TYPES
      value: COMMAND
    - name: ZEEBE_AURA_REQUEST_TOPIC
      value: REQUEST-EXPORTER
    - name: ZEEBE_AURA_TRANSACTION_TOPIC
      value: TRANSACTION-EXPORTER
    - name: ZEEBE_AURA_DAPR_TOPIC
      value: DAPR-EXPORTER
    - name: ZEEBE_AURA_CORRELATION_ID_KEY
      value: CorrelationId
    - name: ZEEBE_AURA_ACTIVITY_ID_KEY
      value: ActivityId
    - name: ZEEBE_AURA_RESPONSE_TOPIC
      value: RESPONSE-EXPORTER
    - name: ZEEBE_AURA_COMPENSATION_TOPIC
      value: COMPENSATION-EXPORTER
    - name: ZEEBE_AURA_MACHINENAME_VARIABLE_KEY
      value: CreatedOnMachineName
    - name: ZEEBE_AURA_DAPR_TOPIC_IS_BULK
      value: 'true'
  envFrom: []
  extraConfiguration: {}
  extraInitContainers: []
  extraVolumeMounts: []
  extraVolumes: []
  image:
    pullSecrets: []
    registry: registry/uat
    repository: camunda/zeebe
    tag: 8.5.5_20240903_113405
  initContainers: []
  ioThreadCount: '3'
  javaOpts: >-
    -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/zeebe/data
    -XX:ErrorFile=/usr/local/zeebe/data/zeebe_error%p.log
    -XX:+ExitOnOutOfMemoryError
  livenessProbe:
    enabled: false
    failureThreshold: 5
    initialDelaySeconds: 30
    periodSeconds: 30
    probePath: /actuator/health/readiness
    scheme: HTTP
    successThreshold: 1
    timeoutSeconds: 1
  log4j2: ''
  logLevel: info
  metrics:
    prometheus: /actuator/prometheus
  nodeSelector: {}
  partitionCount: '3'
  persistenceType: disk
  podAnnotations:
    dapr.io/app-id: zeebe-broker
    dapr.io/app-port: '9600'
    dapr.io/app-protocol: h2c
    dapr.io/disable-builtin-k8s-secret-store: 'true'
    dapr.io/enabled: 'true'
    dapr.io/sidecar-cpu-limit: 200m
    dapr.io/sidecar-memory-limit: 400Mi
  podDisruptionBudget:
    enabled: false
    maxUnavailable: 1
    minAvailable: null
  podLabels: {}
  podSecurityContext:
    fsGroup: 1000
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  priorityClassName: ''
  pvcAccessModes:
    - ReadWriteOnce
  pvcAnnotations: {}
  pvcSize: 32Gi
  pvcStorageClassName: ''
  readinessProbe:
    enabled: true
    failureThreshold: 5
    initialDelaySeconds: 30
    periodSeconds: 30
    probePath: /actuator/health/readiness
    scheme: HTTP
    successThreshold: 1
    timeoutSeconds: 1
  replicationFactor: '1'
  resources:
    limits:
      cpu: '2'
      memory: 4Gi
    requests:
      cpu: 800m
      memory: 1200Mi
  retention:
    enabled: false
    minimumAge: 30d
    policyName: zeebe-record-retention-policy
  service:
    annotations: {}
    commandName: command
    commandPort: 26501
    extraPorts: []
    httpName: http
    httpPort: 9600
    internalName: internal
    internalPort: 26502
    type: ClusterIP
  serviceAccount:
    annotations: {}
    automountServiceAccountToken: false
    enabled: true
    name: ''
  sidecars: []
  startupProbe:
    enabled: false
    failureThreshold: 5
    initialDelaySeconds: 30
    periodSeconds: 30
    probePath: /actuator/health/startup
    scheme: HTTP
    successThreshold: 1
    timeoutSeconds: 1
  strategy:
    type: RollingUpdate
  tolerations: []
type or paste code here

As I see the errror, Stream got closed, you are sending message more than 4MB[default value], There is a defect opened on this issue.

Recommended Action:

As a workaround/to verify this, you could increase the max message size, or limit the variables (fetchVariables ) you get for the jobs, or reduce the maxJobsActive so your payloads are smaller in general.

Reference:

1 Like

@cpbpm Even when there are very few variables in the bpmn flow, I often experience the same issue.