aws camunda zeebe ClientStatusException: deadline exceeded after 19.999991631s. [remote_addr=abc.elb.amazonaws.com/host:port

tokendra · August 27, 2021, 3:19am

Hi All,

I am getting the below exception in my application while processing the messages using camunda zeebe(using aws eks).
While sequencial messsge insertion(Single thread) its working but while parallel insertion(multithread) its not working.
It seems like zeebe worker is not able to handle the load with existing zeebe cluster configuration.
I tried with multiple combination is zeebe configuration like 6/12 broker, 1/24 partitions but getting same issue.

Exception-

io.camunda.zeebe.client.api.command.ClientStatusException: deadline exceeded after 19.999995604s. [remote_addr=a6015d115a-128.us-east-1.elb.amazonaws.com/5.2.2.1:8080] at io.camunda.zeebe.client.impl.ZeebeClientFutureImpl.transformExecutionException(ZeebeClientFutureImpl.java:93) at io.camunda.zeebe.client.impl.ZeebeClientFutureImpl.join(ZeebeClientFutureImpl.java:50) at com.db.matchservice.service.WorkFlowAction.executeWorkflow(WorkFlowAction.java:63)

Caused by: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 19.999995604s. [remote_addr=a6015d115a-128.us-east-1.elb.amazonaws.com/5.2.2.1:8080] at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)

AWS Instance details-

InsatnceType-c5.12xlarge, IOPS-16000, SSD-250, CPU-8, RAM-32

Zeebe cluster configuration:

spec:
      volumes:
        - name: config
          configMap:
            name: zeebe-cluster-zeebe-cluster-helm
            defaultMode: 484
        - name: exporters
          emptyDir: {}
      containers:
        - name: zeebe-cluster-helm
          image: 'camunda/zeebe:1.1.2'
          ports:
            - name: http
              containerPort: 9600
              protocol: TCP
            - name: command
              containerPort: 26501
              protocol: TCP
            - name: internal
              containerPort: 26502
              protocol: TCP
          env:
            - name: ZEEBE_BROKER_CLUSTER_CLUSTERNAME
              value: zeebe-cluster-zeebe
            - name: ZEEBE_LOG_LEVEL
              value: info
            - name: ZEEBE_BROKER_CLUSTER_PARTITIONSCOUNT
              value: '24'
            - name: ZEEBE_BROKER_CLUSTER_CLUSTERSIZE
              value: '6'
            - name: ZEEBE_BROKER_CLUSTER_REPLICATIONFACTOR
              value: '1'
            - name: ZEEBE_BROKER_THREADS_CPUTHREADCOUNT
              value: '24'
            - name: ZEEBE_BROKER_THREADS_IOTHREADCOUNT
              value: '24'
            - name: ZEEBE_BROKER_GATEWAY_ENABLE
              value: 'false'
            - name: ZEEBE_BROKER_EXPORTERS_ELASTICSEARCH_CLASSNAME
              value: io.camunda.zeebe.exporter.ElasticsearchExporter
            - name: ZEEBE_BROKER_EXPORTERS_ELASTICSEARCH_ARGS_URL
              value: 'http://elasticsearch-master:9200'
            - name: ZEEBE_BROKER_NETWORK_COMMANDAPI_PORT
              value: '26501'
            - name: ZEEBE_BROKER_NETWORK_INTERNALAPI_PORT
              value: '26502'
            - name: ZEEBE_BROKER_NETWORK_MONITORINGAPI_PORT
              value: '9600'
            - name: K8S_POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: JAVA_TOOL_OPTIONS
              value: >-
                -XX:MaxRAMPercentage=25.0 -XX:+HeapDumpOnOutOfMemoryError
                -XX:HeapDumpPath=/usr/local/zeebe/data
                -XX:ErrorFile=/usr/local/zeebe/data/zeebe_error%p.log
                -XX:+ExitOnOutOfMemoryError
          resources:
            limits:
              cpu: '1'
              memory: 4Gi
            requests:
              cpu: 500m
              memory: 2Gi
          volumeMounts:
            - name: config
              mountPath: /usr/local/zeebe/config/application.yaml
              subPath: application.yaml
            - name: config
              mountPath: /usr/local/bin/startup.sh
              subPath: startup.sh
            - name: data
              mountPath: /usr/local/zeebe/data
            - name: exporters
              mountPath: /exporters
          readinessProbe:
            httpGet:
              path: /ready
              port: 9600
              scheme: HTTP
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      securityContext: {}
      schedulerName: default-scheduler
  volumeClaimTemplates:
    - kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: data
        creationTimestamp: null
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
        volumeMode: Filesystem
      status:
        phase: Pending
  serviceName: zeebe-cluster-zeebe
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
  revisionHistoryLimit: 10
status:
  observedGeneration: 17
  replicas: 6
  readyReplicas: 6
  currentReplicas: 6
  updatedReplicas: 6
  currentRevision: zeebe-cluster-zeebe-55c88fdf8f
  updateRevision: zeebe-cluster-zeebe-55c88fdf8f
  collisionCount: 0

korthout · August 27, 2021, 8:42am

Hi @tokendra and welcome to the forums

Have you been able to run the cluster successfully with low load or not at all? Do all requests timeout or just a couple?

If you’ve not been able to handle any requests successfully then I suggest you start by reducing your cluster to the bare essentials: 1 broker, 1 partition. When that works, try to increase the number of partitions and when that works increase the number of brokers. Make sure to wipe the persisted volume in between, because Zeebe is unable to deal with changes to the clusterSize and the partitionCount in between.

If you’ve been able to handle most requests and not all, than I’m unsure why you’re seeing timeouts. Normally this should not happen because back pressure should’ve kicked in before. You’ll notice back pressure because requests are responded to with RESOURCE_EXHAUSTED.

Hope this helps

tokendra · August 27, 2021, 6:40pm

Yes cluster runs successfully for single user and single message. but its not working for multiple users(multithread).

korthout · August 30, 2021, 7:56am

It’s unclear to me what you mean with single user and single message.

In any case, I recommend increasing the cpu limit so your brokers can actually use the available cpu of the machines they are running on. To give some insights:

24 partitions are spread over 6 brokers without replication, so that means that your 6 brokers each get 4 partitions. Each partition requires at least 1 cpu, so makes sure they have at least 4 cpu available. More might be better, but that depends on your use case.

tokendra · September 3, 2021, 1:40pm

Hi @korthout - Single user means single thread and message insertion in sequential order.
So our application is connecting to zeebe brokers through zeebe-gateway sequentially.

Multi user means multi threaded message insertion. lets say there are 100 threads connecting to zeebe-gateway. So zeebe gateway needs to handle multiple requests at a time. here its not working. We are getting time out exception.

io.camunda.zeebe.client.api.command.ClientStatusException: Time out between gateway and broker: Request ProtocolRequest{id=63617, subject=command-api-3, sender=175.36.79.226:26502, payload=byte{length=1269, hash=1747922814}} to zeebe-cluster-zeebe-2.zeebe-cluster-zeebe.default.svc.cluster.local:26501 timed out in PT10S

korthout · September 6, 2021, 7:46am

Hi @tokendra,

Thanks for your explanation. Your test setup makes more sense to me now.

Did you already test with my suggested CPU limit changes? I still believe your brokers simply don’t have enough CPU available to run that many partitions with high load.