Task long execution for the rarely created instances

Hello! If the process instance is created rarely (once every few hours), then the tasks are executed up to 5-15 seconds. If often - then less 1 second. The elasticsearch exporter is used. Helm chart is installed with default settings. On the workers themselves, the tasks are executed quickly. Version 8.5.5. What could be the reason?
Maybe some client settings?

{
	"flowNodeType": "SERVICE_TASK",
	"flowNodeId": "Activity_03r4cyn",
	**"startDate": "2025-05-30 09:26:45",**
        **"endDate": "2025-05-30 09:26:45",**
	"eventId": "9007199263594352_9007199263594469",
	"jobDeadline": "2025-05-30 09:27:05",
	"jobType": "JobType_CompleteOrder",
	"jobRetries": 3,
	"jobWorker": "activateJobsWorkerService#processActivateJobsStreaming",
	"jobCustomHeaders": {},
	"incidentErrorType": null,
	"incidentErrorMessage": null,
	"flowNodeInstanceKey": "9007199263594469"
}

The same task ~10 sec

{
	"flowNodeType": "SERVICE_TASK",
	"flowNodeId": "Activity_03r4cyn",
	**"startDate": "2025-05-30 09:28:49",**
	**"endDate": "2025-05-30 09:28:59",**
	"eventId": "4503599636225733_4503599636225853",
	"jobDeadline": "2025-05-30 09:29:19",
	"jobType": "JobType_CompleteOrder",
	"jobRetries": 3,
	"jobWorker": "activateJobsWorkerService#processActivateJobsStreaming",
	"jobCustomHeaders": {},
	"incidentErrorType": null,
	"incidentErrorMessage": null,
	"flowNodeInstanceKey": "4503599636225853"
}

broker logs:

2025-05-30 06:28:49.545 [Broker-2] [zb-actors-2] [StreamProcessor-2] TRACE
      io.camunda.zeebe.broker.process - Process process instance event [context: {intent=ACTIVATE_ELEMENT, elementId=Activity_03r4cyn, bpmnElementType=SERVICE_TASK, elementInstanceKey=4503599636225853, flowScopeKey=4503599636225733, processInstanceKey=4503599636225733, parentProcessInstanceKey=-1, parentElementInstanceKey=-1, bpmnProcessId=MyProcess_step1, processVersion=1, processDefinitionKey=2251799820674400, tenantId=<default>}]
2025-05-30 06:28:49.557 [Broker-2] [zb-actors-0] [AsyncProcessingScheduleActor-2] TRACE
      io.camunda.zeebe.engine.processing.job.JobTimeoutChecker - Job timeout checker running...
2025-05-30 06:28:49.557 [Broker-2] [zb-actors-0] [AsyncProcessingScheduleActor-2] TRACE
      io.camunda.zeebe.engine.processing.job.JobTimeoutChecker - 0 timeout job commands appended to task result builder
2025-05-30 06:28:49.557 [Broker-2] [zb-actors-0] [AsyncProcessingScheduleActor-2] TRACE
      io.camunda.zeebe.stream - Write scheduled TaskResult to dispatcher!
      io.camunda.zeebe.transport - Handled request 318999173032583168 for topic command-api-2
      io.camunda.zeebe.db.impl.rocksdb.ZeebeRocksDBMetricExporter - Exporting RocksDBMetrics took + 0 ms
2025-05-30 06:28:50.560 [Broker-2] [zb-actors-0] [AsyncProcessingScheduleActor-2] TRACE
      io.camunda.zeebe.engine.processing.job.JobTimeoutChecker - Job timeout checker running...
2025-05-30 06:28:50.560 [Broker-2] [zb-actors-0] [AsyncProcessingScheduleActor-2] TRACE
      io.camunda.zeebe.engine.processing.job.JobTimeoutChecker - 0 timeout job commands appended to task result builder
2025-05-30 06:28:50.560 [Broker-2] [zb-actors-0] [AsyncProcessingScheduleActor-2] TRACE
      io.camunda.zeebe.stream - Write scheduled TaskResult to dispatcher!
2025-05-30 06:28:50.739 [Broker-2] [zb-fs-workers-2] [Exporter-1] TRACE
      io.camunda.zeebe.broker.exporter - Received new exporter state {elasticsearch={position=17729113, metadata={"recordCountersByValueType":{"JOB":1604,"DEPLOYMENT":20,"PROCESS_INSTANCE":13043,"INCIDENT":4,"MESSAGE":342,"MESSAGE_SUBSCRIPTION":35,"PROCESS_MESSAGE_SUBSCRIPTION":52,"TIMER":1096,"VARIABLE":4520,"PROCESS_INSTANCE_CREATION":48,"PROCESS":35,"RESOURCE_DELETION":20,"COMMAND_DISTRIBUTION":212}}}}
2025-05-30 06:28:51.142 [Broker-2] [zb-actors-2] [ServerTransport] TRACE
      io.camunda.zeebe.transport - Handled request 318999179630223360 for topic command-api-2
2025-05-30 06:28:59.434 [Broker-2] [zb-actors-0] [AsyncProcessingScheduleActor-2] TRACE
      io.camunda.zeebe.engine.processing.job.JobTimeoutChecker - Job timeout checker running...
2025-05-30 06:28:59.434 [Broker-2] [zb-actors-0] [AsyncProcessingScheduleActor-2] TRACE
      io.camunda.zeebe.engine.processing.job.JobTimeoutChecker - 0 timeout job commands appended to task result builder
2025-05-30 06:28:59.434 [Broker-2] [zb-actors-0] [AsyncProcessingScheduleActor-2] TRACE
      io.camunda.zeebe.stream - Write scheduled TaskResult to dispatcher
      io.camunda.zeebe.journal.file.SegmentsFlusher - Flushed 1 segment(s), from index 17702755 to index 17702755
2025-05-30 06:28:59.441 [Broker-2] [zb-actors-0] [ServerTransport] TRACE
      io.camunda.zeebe.transport - Send response to request 318999214430363648
2025-05-30 06:28:59.799 [Broker-2] [zb-actors-3] [ServerTransport] TRACE
      io.camunda.zeebe.transport - Handled request 318999215940313088 for topic command-api-2
2025-05-30 06:28:59.801 [Broker-2] [raft-server-2-2] [raft-server-2] TRACE
      io.camunda.zeebe.journal.file.Segment - Flushed segment 66 from index 17641299 to index 17702779
2025-05-30 06:28:59.801 [Broker-2] [raft-server-2-2] [raft-server-2] TRACE
      io.camunda.zeebe.journal.file.SegmentsFlusher - Flushed 1 segment(s), from index 17702779 to index 17702779
2025-05-30 06:28:59.803 [Broker-2] [zb-actors-1] [StreamProcessor-2] TRACE
      io.camunda.zeebe.broker.process - Process process instance event [context: {intent=COMPLETE_ELEMENT, elementId=Activity_03r4cyn, bpmnElementType=SERVICE_TASK, elementInstanceKey=4503599636225853, flowScopeKey=4503599636225733, processInstanceKey=4503599636225733, parentProcessInstanceKey=-1, parentElementInstanceKey=-1, bpmnProcessId=MyProcess_step1, processVersion=1, processDefinitionKey=2251799820674400, tenantId=<default>}]
2025-05-30 06:28:59.804 [Broker-2] [zb-actors-1] [StreamProcessor-2] TRACE
      io.camunda.zeebe.broker.process - Process process instance event [context: {intent=ACTIVATE_ELEMENT, elementId=Event_1m9wyfu, bpmnElementType=END_EVENT, elementInstanceKey=4503599636225915, flowScopeKey=4503599636225733, processInstanceKey=4503599636225733, parentProcessInstanceKey=-1, parentElementInstanceKey=-1, bpmnProcessId=MyProcess_step1, processVersion=1, processDefinitionKey=2251799820674400, tenantId=<default>}]
2025-05-30 06:28:59.805 [Broker-2] [zb-actors-1] [StreamProcessor-2] TRACE
      io.camunda.zeebe.broker.process - Process process instance event [context: {intent=COMPLETE_ELEMENT, elementId=MyProcess_step1, bpmnElementType=PROCESS, elementInstanceKey=4503599636225733, flowScopeKey=-1, processInstanceKey=4503599636225733, parentProcessInstanceKey=-1, parentElementInstanceKey=-1, bpmnProcessId=MyProcess_step1, processVersion=1, processDefinitionKey=2251799820674400, tenantId=<default>}]
2025-05-30 06:28:59.806 [Broker-2] [zb-actors-1] [ServerTransport] TRACE

gw logs (there are no other entries related to this worker):

2025-05-30 06:28:59.764 [Gateway-camunda-zeebe-gateway-66f74b8bb-g2khc] [zb-actors-0] [ActivateJobsHandler] TRACE
      io.camunda.zeebe.gateway.longPolling - Unblocking ActivateJobsRequest BrokerActivateJobsRequest{requestDto={"type":"JobType_CompleteOrder","worker":"activateJobsWorkerService#processActivateJobsStreaming","timeout":20000,"maxJobsToActivate":32,"jobKeys":[],"jobs":[],"variables":[],"truncated":false,"tenantIds":["<default>"]}}

Hi @Bos_kh - how do you have your Zeebe client configured for the job worker? Are you using job streaming? I suspect the issue is the polling delay on the job worker. Zeebe queues jobs and the job workers poll Zeebe for active jobs. Depending on how you have the polling configured, and how often jobs are available, you can encounter some variation in how quickly a job is activated for that job worker.

The simplest way to minimize that delay is to enable job streaming.

Hello @nathan.loding! Thanks for your reply. I will try switching to streaming mode. What other optimization options are there for periodically executed workflows?
Current client settings:

camunda:
  client:
    mode: self-managed
    zeebe:
      enabled: true
      grpc-address: https://url:26500
      keep-alive: 30
      defaults:
        tenant-ids: <default>
        auto-complete: false

@Bos_kh - I would experiment with different polling intervals, which is set either in the builder when creating the client, or in the application properties:

camunda:
  client:
    job:
      poll-interval: 1m

This controls how frequently the client is polling for active jobs.

Hello!
Added the stream-enabled: true flag to the service
The problem still persists.

usually less than a second, but here 8 s:
{

“flowNodeType”: “SERVICE_TASK”,
“flowNodeId”: “Activity_03r4cyn”,

“startDate”: “2025-06-05 12:50:56”,
“endDate”: “2025-06-05 12:51:04”,

“eventId”: “2251799823815601_2251799823815701”,
“jobType”: “JobType_CompleteOrder”,
“jobRetries”: 3,
“jobWorker”: “activateJobsWorkerService#processActivateJobsStreaming”,
“jobCustomHeaders”: {},
“jobDeadline”: null,
“incidentErrorType”: null,
“incidentErrorMessage”: null,
“flowNodeInstanceKey”: “2251799823815701”
}

Worker config:

camunda:
  client:
    mode: self-managed
    zeebe:
      enabled: true
      grpc-address: http://camunda-zeebe-gateway.camunda-platform.svc:26500
      message-time-to-live: 15M
      defaults:
        tenant-ids: <default>
        auto-complete: false
        stream-enabled: true
        stream-timeout: 604800S

Similar delays occur on vanilla camunda connectors (graphql)
Also deployed the service and bpmp process from the camunda repository (process_payments:0.0.1-SNAPSHOT), the delay also occurs with the default client settings.

spring:
  application:
    name: process_payments
camunda:
  client:
    mode: self-managed
    zeebe:
      enabled: true
      grpc-address: http://camunda-zeebe-gateway.camunda-platform.svc:26500
      rest-address: http://camunda-zeebe-gateway.camunda-platform.svc:8080

camunda logs:

2025-06-05 12:50:56.092 [Broker-1] [zb-actors-2] [StreamProcessor-1] TRACE
      io.camunda.zeebe.broker.process - Process process instance event [context: {intent=ACTIVATE_ELEMENT, elementId=Activity_03r4cyn, bpmnElementType=SERVICE_TASK, elementInstanceKey=2251799823815701, flowScopeKey=2251799823815601, processInstanceKey=2251799823815601, parentProcessInstanceKey=-1, parentElementInstanceKey=-1, bpmnProcessId=ChangeSim_step1, processVersion=1, processDefinitionKey=2251799820674400, tenantId=<default>}]
2025-06-05 12:50:56.094 [Broker-1] [zb-actors-2] [RemoteStreamerImpl] TRACE
      io.camunda.zeebe.transport.stream.impl.RemoteStreamPusher - Pushed {"key":2251799823815704,"record":{"deadline":1749128156085,"timeout":-1,"worker":"activateJobsWorkerService#processActivateJobsStreaming","retries":3,"retryBackoff":0,"recurringTime":-1,"type":"JobType_CompleteOrder","customHeaders":[packed value (length=1)],"variables":"xxxxx","errorMessage":"","errorCode":"","bpmnProcessId":"Change_step1","processDefinitionVersion":1,"processDefinitionKey":2251799820674400,"processInstanceKey":2251799823815601,"jobKind":"BPMN_ELEMENT","elementId":"Activity_03r4cyn","elementInstanceKey":2251799823815701,"tenantId":"<default>"}} to stream StreamId[streamId=10ff2552-0cb5-485b-a822-8506a05c87f2, receiver=camunda-zeebe-gateway-7bbb5ffb6f-rkw7l]
2025
2025-06-05 12:51:04.351 [Broker-1] [zb-actors-1] [StreamProcessor-1] TRACE
      io.camunda.zeebe.broker.process - Process process instance event [context: {intent=COMPLETE_ELEMENT, elementId=Activity_03r4cyn, bpmnElementType=SERVICE_TASK, elementInstanceKey=2251799823815701, flowScopeKey=2251799823815601, processInstanceKey=2251799823815601, parentProcessInstanceKey=-1, parentElementInstanceKey=-1, bpmnProcessId=Change_step1, processVersion=1, processDefinitionKey=2251799820674400, tenantId=<default>}]

Worker log (12:51:04):

{"@timestamp":"2025-06-05T12:51:04.313949894Z","@version":"1","message":"Received jobId: 2251799823815704, bpmnProcessId: Change_step1, Variables: {,"logger_name":"cocm.services.ActivateJobsWorkerService","thread_name":"pool-5-thread-1","level":"INFO","level_value":20000}
{"@timestamp":"2025-06-05T12:51:04.3550907Z","@version":"1","message":"Job with process instance id Change_step1 is completed","logger_name":"cocm.services.ActivateJobsWorkerService","thread_name":"grpc-default-executor-63","level":"INFO","level_value":20000}

Also, sometimes the log shows entries like:

2025-06-05 12:50:50.910 [Broker-2] [zb-actors-1] [RemoteStreamerImpl] WARN 
      io.camunda.zeebe.transport.stream.impl.RemoteStreamPusher - Failed to push (size = 1622) to stream StreamId[streamId=10ff2552-0cb5-485b-a822-8506a05c87f2, receiver=camunda-zeebe-gateway-7bbb5ffb6f-rkw7l]
java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Request stream-push to 10.222.11.111:26502 timed out in PT5S
	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$sendAndReceive$4(NettyMessagingService.java:259) ~[zeebe-atomix-cluster-8.5.6.jar:8.5.6]
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.112.Final.jar:4.1.112.Final]
	at java.base/java.lang.Thread.run(Unknown Source) ~[?:?]
Caused by: java.util.concurrent.TimeoutException: Request stream-push to 10.222.11.111:26502 timed out in PT5S
	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$sendAndReceive$4(NettyMessagingService.java:261) ~[zeebe-atomix-cluster-8.5.6.jar:8.5.6]
	... 7 more
2025-06-05 12:50:50.910 [Broker-2] [zb-actors-1] [RemoteStreamerImpl] TRACE
      io.camunda.zeebe.transport.stream.impl.RemoteStreamImpl - Failed to push payload (size = 1622), no more streams to retry
2025-06-05 12:50:50.910 [Broker-2] [zb-actors-1] [RemoteJobStreamErrorHandler] TRACE
      io.camunda.zeebe.broker.jobstream.YieldingJobStreamErrorHandler - Failed to push job 4503599637496008. Yielding...
java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Request stream-push to 10.222.12.189:26502 timed out in PT5S
	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$sendAndReceive$4(NettyMessagingService.java:259) ~[zeebe-atomix-cluster-8.5.6.jar:8.5.6]
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.112.Final.jar:4.1.112.Final]
	at java.base/java.lang.Thread.run(Unknown Source) ~[?:?]
Caused by: java.util.concurrent.TimeoutException: Request stream-push to 10.222.12.189:26502 timed out in PT5S
	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$sendAndReceive$4(NettyMessagingService.java:261) ~[zeebe-atomix-cluster-8.5.6.jar:8.5.6]
	... 7 more

Broker config:

zeebe:
  broker:
    exporters:
      elasticsearch:
        className: "io.camunda.zeebe.exporter.ElasticsearchExporter"
        args:
          url: "http://camunda-elasticsearch:9200"
          requestTimeoutMs: 60000
          bulk:
            delay: 1
            size: "1000"
          index:
            prefix: "zeebe-record"
            numberOfShards: "4"
            message: false
            messageStartEventSubscription: false
            messageSubscription: false
            processMessageSubscription: false
          retention:
            enabled: true
            minimumAge: "4d"
            policyName: "zeebe-record-retention-policy"
    gateway:
      enable: true
      network:
        port: 26500
      security:
        enabled: false
        authentication:
          mode: none
    network:
      host: 0.0.0.0
      commandApi:
        port: 26501
      internalApi:
        port: 26502
      monitoringApi:
        port: "9600"
    cluster:
      clusterSize: "4"
      replicationFactor: "3"
      partitionsCount: "4"
      clusterName: camunda-zeebe
    threads:
      cpuThreadCount: "4"
      ioThreadCount: "4"

gw config:

server:
  address: "0.0.0.0"
  port: "8080"
zeebe:
  gateway:
    monitoring:
      host: 0.0.0.0
      port: "9600"
    cluster:
      clusterName: camunda-zeebe
      port: "26502"
    longPolling:
      enabled: true
      timeout: 100000000
      probeTimeout: 10000
      minEmptyResponses: 3

No degradation in cpu/ram/io metrics is visible. There are enough resources.



Hi @Bos_kh - you say “usually less than a second, but here’s 8s” - can you clarify what you mean? What percentage of jobs are in the “usually less than a second” category, and what percentage in the 8s category?

Hello, @nathan.loding!
Well, here I meant a specific job, 8 seconds as an example, in reality the fluctuations are 2-12 seconds (when “degradation” occurs), with the norm being less than 1 second.
If you run a regular load, then the entire bpmn process runs in an interval of up to 2 seconds (I didn’t calculate the percentile, but this should definitely be done).
As soon as we remove the load and make occasional calls (not more often than once every 3 hours, for example), then we are guaranteed to get a delay in execution on the first instances of the processes. The delay occurs on different tasks.
Workloads in production will also be episodic.