Camunda 8 Self-managed breaks when simulating high load

Cody_Newman · July 5, 2022, 11:47am

While testing Camunda 8 self-managed, I encountered a strange scenario:

I created a java application that can start some simple processes and run external tasks. It also has a functionality to start a batch of 100 processes at once to simulate larger loads.
The process can be found below:

image799×668 7.41 KB

The steps log a simple string, then store in a local database the action and continue execution.
When starting a single process, everything works fine and it is completed.
When starting a batch of processes (tested and reproduced with 10, 100 and 1000), some of them work for a while, then everything breaks and nothing continues execution. In this case, the following error is seen:

RESOURCE_EXHAUSTED: Expected to execute the command on one of the partitions, but all failed; there are no more partitions available to retry. Please try again. If the error persists contact your zeebe operator

Also, some processes seem to be lost and inside the logs there’s a warning:

Ignoring the the error of type 'NOT_FOUND' during {command=class io.camunda.zeebe.client.impl.command.CompleteJobCommandImpl, job={"key":2251799813687745,"type":"log1","customHeaders":{},"processInstanceKey":2251799813687728,"bpmnProcessId":"test_process","processDefinitionVersion":1,"processDefinitionKey":2251799813685250,"elementId":"Activity_0bc6ur9","elementInstanceKey":2251799813687739,"worker":"log1Worker#log1","retries":3,"deadline":1657021289204,"variables":"{\"seed\":100}","variablesAsMap":{"seed":100}}, currentRetryDelay=50}. Job might have been canceled or already completed.

The docker-compose.yaml file I use is the default one, but with some changes: (added kibana and set to always use latest versions)

# While the Docker images themselves are supported for production usage,
# this docker-compose.yaml is designed to be used by developers to run
# an environment locally. It is not designed to be used in production.
# We recommend to use Kubernetes in production with our Helm Charts:
# https://docs.camunda.io/docs/self-managed/platform-deployment/kubernetes-helm/
# For local development, we recommend using KIND instead of `docker-compose`:
# https://docs.camunda.io/docs/self-managed/platform-deployment/kubernetes-helm/#installing-the-camunda-helm-chart-locally-using-kind

# This is a lightweight configuration with Zeebe, Operate, Tasklist, and Elasticsearch
# See docker-compose.yml for a configuration that also includes Optimize, Identity, and Keycloak.

services:

  zeebe: # https://docs.camunda.io/docs/self-managed/platform-deployment/docker/#zeebe
    image: camunda/zeebe:latest
    container_name: zeebe
    ports:
      - "26500:26500"
      - "9600:9600"
    environment: # https://docs.camunda.io/docs/self-managed/zeebe-deployment/configuration/environment-variables/
      - ZEEBE_BROKER_EXPORTERS_ELASTICSEARCH_CLASSNAME=io.camunda.zeebe.exporter.ElasticsearchExporter
      - ZEEBE_BROKER_EXPORTERS_ELASTICSEARCH_ARGS_URL=http://elasticsearch:9200
      - ZEEBE_BROKER_EXPORTERS_ELASTICSEARCH_ARGS_BULK_SIZE=1
      # allow running with low disk space
      - ZEEBE_BROKER_DATA_DISKUSAGECOMMANDWATERMARK=0.998
      - ZEEBE_BROKER_DATA_DISKUSAGEREPLICATIONWATERMARK=0.999
      - "JAVA_TOOL_OPTIONS=-Xms512m -Xmx512m"
    restart: always
    volumes:
      - zeebe:/usr/local/zeebe/data
    networks:
      - camunda-platform
    depends_on:
      - elasticsearch

  operate: # https://docs.camunda.io/docs/self-managed/platform-deployment/docker/#operate
    image: camunda/operate:latest
    container_name: operate
    ports:
      - "8082:8080"
    environment: # https://docs.camunda.io/docs/self-managed/operate-deployment/configuration/
      - CAMUNDA_OPERATE_ZEEBE_GATEWAYADDRESS=zeebe:26500
      - CAMUNDA_OPERATE_ELASTICSEARCH_URL=http://elasticsearch:9200
      - CAMUNDA_OPERATE_ZEEBEELASTICSEARCH_URL=http://elasticsearch:9200
    networks:
      - camunda-platform
    depends_on:
      - zeebe
      - elasticsearch

  tasklist: # https://docs.camunda.io/docs/self-managed/platform-deployment/docker/#tasklist
    image: camunda/tasklist:latest
    container_name: tasklist
    ports:
      - "8083:8080"
    environment: # https://docs.camunda.io/docs/self-managed/tasklist-deployment/configuration/
      - CAMUNDA_TASKLIST_ZEEBE_GATEWAYADDRESS=zeebe:26500
      - CAMUNDA_TASKLIST_ELASTICSEARCH_URL=http://elasticsearch:9200
      - CAMUNDA_TASKLIST_ZEEBEELASTICSEARCH_URL=http://elasticsearch:9200
    networks:
      - camunda-platform
    depends_on:
      - zeebe
      - elasticsearch

  elasticsearch: # https://hub.docker.com/_/elasticsearch
    image: docker.elastic.co/elasticsearch/elasticsearch:${ELASTIC_VERSION:-7.17.0}
    container_name: elasticsearch
    ports:
      - "9200:9200"
      - "9300:9300"
    environment:
      - bootstrap.memory_lock=true
      - discovery.type=single-node
      # allow running with low disk space
      - cluster.routing.allocation.disk.threshold_enabled=false
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    restart: always
    healthcheck:
      test: [ "CMD-SHELL", "curl -f http://localhost:9200/_cat/health | grep -q green" ]
      interval: 30s
      timeout: 5s
      retries: 3
    volumes:
      - elastic:/usr/share/elasticsearch/data
    networks:
      - camunda-platform
  kibana:
    container_name: kibana
    image: docker.elastic.co/kibana/kibana:7.11.0
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    networks:
      - camunda-platform
    depends_on:
      - elasticsearch
    ports:
      - 5601:5601

volumes:
  zeebe:
  elastic:

networks:
  camunda-platform:

I also performed similar tests on version 7, and everything seemed fine for 1000+ processes started at once.

Ingo_Richtsmeier · July 6, 2022, 9:16am

Hi @Cody_Newman,

you experience back pressure here. Background information can be found here: Internal processing | Camunda Platform 8.

There are some options to configure the behavior of the engine: Backpressure | Camunda Platform 8

Or handle the resource exhausted exception in your worker (wait for a while before you send the next command).

Hope this helps, Ingo

system · January 31, 2024, 10:20am