Zeebe 0.23.7 and 0.26.1 problems - Jobs intermittently pauses execution and associated worker is not invoked

antoniodfr · March 4, 2021, 7:12pm

Hi everybody,

we are using Zeebe in production, but we are suffering problems that we cannot reproduce. Randomly one workflow execution gets blocked and the worker taking that jobType is not activated, but after aprox. 300 seconds, it is activated.

Our workers use Java Clients, and we have workers deployed in k8s cluster, but Zeebe is installed outside the cluster, in 3 MVs, one for gateway standalone and Elastic, other with 3 brokers and the last one with other 3 brokers. Our Zeebe cluster configuration is:

cluster size: 6
partitions: 4
replication: 3

We don’t know why is this happening (no cause in logs of gateway or brokers), and we decided last week to move to a new Zeebe Cluster in version 0.26.1, but we are having the same behavior.

We have found this environment variable ZEEBE_BROKER_STEPTIMEOUT, which has 5 minutes as default value, and we don’t know if that is the reason to workers to activate 300 seconds later when the unknown error occurs.

Do you have any ideas about this random behavior?

Any idea is greatly appreciated, as we are receiving doubts about Zeebe.
Thank you in advance.

jwulf · March 4, 2021, 7:24pm

Hi @antoniodfr , what’s an MV?

antoniodfr · March 4, 2021, 10:36pm

I’m sorry @jwulf, I meant a VM, a Virtual Machine outside the k8s cluster where workers and other microservices are deployed.

jwulf · March 5, 2021, 12:07am

This could be caused by a worker activating the job, then not completing the job. The broker would then time out the activation, and make it available to another worker.

Could this be the issue? What timeout are the workers specifying when they activate jobs?

antoniodfr · March 5, 2021, 7:05am

Hi @jwulf, we have workers that don’t set requestTimeout (and I understand that by default this time is 20 seconds), and we have a requestTimeout of 6 seconds in some of them. We are always using send().join() when sending commands from the workers.

But we don’t know why we get that random behavior of workflows that are paused in a step (it can be in different steps of the same workflow), and the worker is not activated, but after 300 seconds (and we don’t know why after this exact period), the worker is activated and the worklow completes correctly.

Thank you in advance.

jwulf · March 5, 2021, 8:11am

I don’t think it is 20 seconds. The timer sweep for job activation timeouts is every 30 seconds. So anything less than that cannot be guaranteed. See here: https://github.com/zeebe-io/zeebe/issues/5073.

You might find it is five minutes. I’d check the Java client source code.

antoniodfr · March 9, 2021, 7:54am

Hi @jwulf, you are correct, I have reviewed Java Client 0.26.1 source code and the default jobTimeout is five minutes.

But my question is, do you know why we are having that behavior of job launched, but corresponding worker not activated? (We have logs of activation init in workers, and ramdomly, they don’t get activated, as there are no logs) . After those 5 minutes, the worker is activated and the execution of the workflow completes ok.

Thank you in advance.

jwulf · March 9, 2021, 9:14am

I think what you are describing is “token enters the task, job is activated 5 minutes later by worker”.

You haven’t described it here, but I am imagining that you are looking in Operate and seeing the task activation startDate and endDate, like this:

Screen Shot 2021-03-09 at 10.10.06 PM

Am I correct so far?

If I look at this example from one of my workflows, I am left wondering: is the startDate the token entering this task, or the worker activating the job?

I’m not sure.

But anyway, when I roll my 20-sided Zeebe debugging dice, it says that your worker is not completing the job, and it is timing out and being reactivated.

The worker has a code branch in it that under some combination of circumstances does not throw an unhandled exception, and does not call job.complete(). It’s activating the job, not throwing, and not completing it. The job is then timing out on the broker and being reactivated. The second time round the stars do not align, because it is an edge case. So you do not see it happen twice in a row, and hence a delay of 10 minutes for a job.

Look for that code.

antoniodfr · March 9, 2021, 10:49am

Exactly @jwulf, that is our case, we are seeing that executions from operate, like this example:

MicrosoftTeams-image

In this case, the execution is stopped after 205 seconds as we have this workflow rounded by a timer event of that duration.

We are getting this behavior in different workers of this workflow, and we don’t see in none of them the first activation trace (we are using Java Client workers, with Spring Zeebe Starter, deployed in K8s):

@NewSpan("check-duplicate-callback-worker")
@ZeebeWorker(name = "checkDuplicateCallback", type = WORKER_TYPE)
public void checkDuplicateCallback(final JobClient jobClient, final ActivatedJob job) {
    try {
		long jobKey = job.getKey();
		log.info("INIT {}. JobKey: {}", WORKER_TYPE, jobKey);

And we are not seeing any “INIT” execution in logs.

We have migrated our cluster from Zeebe 0.23.7 to Zeebe 0.26.1, and we get this behavior less frequently, but it still occurs.

If you have any idea it will be greatly appreciated!

Thank you in advance.

antoniodfr · March 9, 2021, 11:04am

Hi @jwulf , reviewing our Grafana Zeebe Monitoring, we have detected drop events in Command API BackPressure.

We are using gradient algorithm, but we don’t see an important increase of load in the system in that moment.

jwulf · March 9, 2021, 6:48pm

If the job is not activating in the worker, then I would suspect the polling of the client.

Intermittent errors are challenging to debug. Trying to get a reliable reproducer is one approach. This exercise forces you to identify the exact set of circumstances under which it happens.

Using open source software, you have to consider the source code as your own source code. You didn’t write the Spring client or the Java client that it wraps, but it is your application code.

So one approach is to reason through the code and look for places where you can instrument it with debugging statements to say things like “I’m polling the gateway for more jobs”, “I got a response back”, etc.

And you need to build a minimum reproducer that does just the things needed to demonstrate the problem. Many times, building this reveals the problem, as you are forced to introduce things one-by-one until you can reproduce.

Given your description of it so far:

It’s intermittent
The broker enters the task
The worker does not report INIT of its handler
The job is re-activated after the duration of the worker request timeout

It’s most probably in one of these places:

Some problem in the broker that does not make the job available for activation (possibly, maybe not likely because not reported by others)
Some problem in the Spring client around polling and activation.
Some problem in the Java client around polling and activation.

I would change the activation timeout in the worker. If this changes the delay (from 300 seconds to the new value of the activation timeout), then this will demonstrate that the gateway has activated the job and streamed it to the worker, but the worker is either not invoking your handler, or not sending the complete message back to the broker gateway. And then I would inspect these two points in my reproducer.

This is the way.

marcoplaut · March 9, 2021, 8:53pm

Hello,

The symptoms described here look like an issue I’m having and which is already reported here :

github.com/camunda/zeebe

Worker request ActivateJobs activate job in Zeebe but get an empty response

opened 12:18AM - 27 Nov 20 UTC

closed 08:19AM - 30 Aug 21 UTC

vtexier

kind/bug scope/broker area/performance severity/mid scope/gateway

**Describe the bug** Ramdomly and not often, a worker requesting ActivateJobs o…n Zeebe get an empty answer, but the job is activated in Zeebe so it will never be done. **To Reproduce** Make an integration test that is a loop of this sequence: - Deploy workflow with one service task. - Create workflow instance. - Wait for the job to be completed by the worker with a timeout. - Repeat... Request parameters: ```python activate_jobs_response = stub.ActivateJobs( gateway_pb2.ActivateJobsRequest( # the job type, as defined in the BPMN process type=agent_job_type, # the name of the agent activating the jobs, mostly used for logging purposes worker=agent_job_type, # a job returned after this call will not be activated by another call until the # timeout( in ms) has been reached timeout=43200000, # 12h # the maximum jobs to activate by this request maxJobsToActivate=5, # a list of variables to fetch as the job variables; if empty, all visible variables at the time # of activation for the scope of the job will be returned # fetchVariable=[] # The request will be completed when at least one job is activated or after the # requestTimeout (in ms). if the requestTimeout = 0, a default timeout is used. if the # requestTimeout < 0, long polling is disabled and the request is completed immediately, # even when no job is activated. requestTimeout=5 ) ) ``` **Expected behavior** If the worker ActivateJobs request activate jobs in Zeebe, they must be returned to worker in response of the request. **Log/Stacktrace** You'll find in the attached file: * ES exported event (help see the time of events) * TCPDUMP of API request/response between worker and Zeebe * Zeebe logs at level=ALL [Zeebe ActivateJobs Bug.txt](https://github.com/zeebe-io/zeebe/files/5605360/Zeebe.ActivateJobs.Bug.txt) **Environment:** - OS: Ubuntu Mate 18.04 (worker), Ubuntu Mate 16.04 (Zeebe) - Zeebe Version: 0.25.1 - Configuration: Python client: ``` Package Version ----------------- --------- certifi 2020.11.8 chardet 3.0.4 docker 4.4.0 elasticsearch 6.8.1 elasticsearch-dsl 6.4.0 grpcio 1.33.2 idna 2.10 pip 20.2.4 protobuf 3.13.0 python-dateutil 2.8.1 PyYAML 5.3.1 requests 2.25.0 setuptools 50.3.2 six 1.15.0 urllib3 1.26.2 websocket-client 0.57.0 wheel 0.35.1 zeebe-grpc 0.25.1.0 ``` Zeebe config: ```yaml --- # ---------------------------------------------------- # Zeebe Standalone Broker configuration file (with embedded gateway) # This file is based on broker.standalone.yaml.template but stripped down to contain only a limited # set of configuration options. These are a good starting point to get to know Zeebe. # For advanced configuration options, have a look at the templates in this folder. # !!! Note that this configuration is not suitable for running a standalone gateway. !!! # If you want to run a standalone gateway node, please have a look at gateway.yaml.template # ---------------------------------------------------- # Byte sizes # For buffers and others must be specified as strings and follow the following # format: "10U" where U (unit) must be replaced with KB = Kilobytes, MB = Megabytes or GB = Gigabytes. # If unit is omitted then the default unit is simply bytes. # Example: # sendBufferSize = "16MB" (creates a buffer of 16 Megabytes) # # Time units # Timeouts, intervals, and the likes, must be specified either in the standard ISO-8601 format used # by java.time.Duration, or as strings with the following format: "VU", where: # - V is a numerical value (e.g. 1, 5, 10, etc.) # - U is the unit, one of: ms = Millis, s = Seconds, m = Minutes, or h = Hours # # Paths: # Relative paths are resolved relative to the installation directory of the broker. zeebe: broker: gateway: # Enable the embedded gateway to start on broker startup. # This setting can also be overridden using the environment variable ZEEBE_BROKER_GATEWAY_ENABLE. enable: true network: # Sets the port the embedded gateway binds to. # This setting can also be overridden using the environment variable ZEEBE_BROKER_GATEWAY_NETWORK_PORT. port: 26500 security: # Enables TLS authentication between clients and the gateway # This setting can also be overridden using the environment variable ZEEBE_BROKER_GATEWAY_SECURITY_ENABLED. enabled: false network: # Controls the default host the broker should bind to. Can be overwritten on a # per binding basis for client, management and replication # This setting can also be overridden using the environment variable ZEEBE_BROKER_NETWORK_HOST. host: 0.0.0.0 data: # Specify a list of directories in which data is stored. # This setting can also be overridden using the environment variable ZEEBE_BROKER_DATA_DIRECTORIES. directories: [ "{{ zeebe_data_directories }}" ] # The size of data log segment files. # This setting can also be overridden using the environment variable ZEEBE_BROKER_DATA_LOGSEGMENTSIZE. logSegmentSize: 512MB # How often we take snapshots of streams (time unit) # This setting can also be overridden using the environment variable ZEEBE_BROKER_DATA_SNAPSHOTPERIOD. snapshotPeriod: 15m cluster: # Specifies the Zeebe cluster size. # This can also be overridden using the environment variable ZEEBE_BROKER_CLUSTER_CLUSTERSIZE. clusterSize: 1 # Controls the replication factor, which defines the count of replicas per partition. # This can also be overridden using the environment variable ZEEBE_BROKER_CLUSTER_REPLICATIONFACTOR. replicationFactor: 1 # Controls the number of partitions, which should exist in the cluster. # This can also be overridden using the environment variable ZEEBE_BROKER_CLUSTER_PARTITIONSCOUNT. partitionsCount: 1 threads: # Controls the number of non-blocking CPU threads to be used. # WARNING: You should never specify a value that is larger than the number of physical cores # available. Good practice is to leave 1-2 cores for ioThreads and the operating # system (it has to run somewhere). For example, when running Zeebe on a machine # which has 4 cores, a good value would be 2. # This setting can also be overridden using the environment variable ZEEBE_BROKER_THREADS_CPUTHREADCOUNT cpuThreadCount: 2 # Controls the number of io threads to be used. # This setting can also be overridden using the environment variable ZEEBE_BROKER_THREADS_IOTHREADCOUNT ioThreadCount: 2 # Elasticsearch Exporter ---------- # An example configuration for the elasticsearch exporter: # # These setting can also be overridden using the environment variables "ZEEBE_BROKER_EXPORTERS_ELASTICSEARCH_..." # exporters: elasticsearch: className: io.zeebe.exporter.ElasticsearchExporter args: url: http://elasticsearch:9200 bulk: delay: 5 size: 1000 # authentication: # username: elastic # password: changeme index: prefix: zeebe-record createTemplate: true command: false event: true rejection: false deployment: true error: true incident: true job: true jobBatch: false message: false messageSubscription: false variable: true variableDocument: true workflowInstance: true workflowInstanceCreation: false workflowInstanceSubscription: false ignoreVariablesAbove: 32677 ```

It was initially discussed here : Documentation Element (1 or more) and Camunda Extension to "Type" the Documentation

jwulf · March 9, 2021, 10:45pm

Thanks @marcoplaut, that does look like it could be it. Totally consistent with this behaviour.

@antoniodfr - can you please add a comment to the GitHub issue stating that you are seeing it with the Spring client? That helps with the impact triage and will affect the prioritisation of a fix.

I’ll raise it in the stakeholder input meeting next week, if it hasn’t been fixed before then. It sounds like a timing edge case that isn’t caught in unit tests.

marcoplaut · March 9, 2021, 11:00pm

@jwulf what is particularly complicated with that issue is that it’s not always happening and it seems to be (in my case) happening a lot when using a k8s cluster with multiple nodes but is really rare when I run the same test locally on Docker.

If I can help to test a fix, or help to reproduce, do not hesitate to contact me.

jwulf · March 10, 2021, 1:55am

Running locally with Docker means a single broker?

In that case it may be a timeout between the gateway and the broker, ie: the gateway gets the job from the broker and says: “ACTIVATED”, but then the requestTimeout to the client times out just then, so the gateway does not forward the job to the client.

This issue looks like a probable cause: Make activated jobs which were not send to clients re-activatable · Issue #3631 · camunda-cloud/zeebe · GitHub

Update: this issue also looks like it is the same thing: java client: jobs are getting activated but not coming back to client (intermittent issue) · Issue #3585 · camunda-cloud/zeebe · GitHub

jwulf · March 11, 2021, 3:41am

I did some digging into the request-response lifecycle. You may be able to work around this (or at least mitigate it) by increasing the request timeout of the ActivateJobs call. See; spring-zeebe/ZeebeWorker.java at master · zeebe-io/spring-zeebe · GitHub

Or by reducing the maxJobsActive, or both.

This seems to occur when the requestTimeout limit is reached right as a broker streams a job to the gateway for the request.

If you tune it so that the gateway always fills the number of jobs for the request before the request times out (by expanding the requestTimeout or reducing the jobs, or both), then this condition can’t occur.

This might reduce / remove the scenario while we’re waiting on the root cause fix.

By default, there is a timeout of 15s between the Gateway and each broker nodes (partition leader) [See here]. So if you have 3 partitions, then you have a total timeout exposure window for the job activation request of 45s in the cluster at the worst case. So your request timeout for the client should be 45s plus 10s to be safe.

irchuvieco · March 12, 2021, 10:49am

Hi

As @marcoplaut and @antoniodfr told us we are under the same scenario with Java Client (Spring), we are experimenting intermittently pauses execution when multiple pods for the same worker are running in the kube cluster. It is being really difficult for us to reproduce the problem.

We are going to apply your workaround reducing maxJobActive property in the client and deploying a workflow in our kube cluster where the workers will run in a single pod instance mode in order to check if the problems disappear

Best regards

irchuvieco · March 22, 2021, 8:46am

Hi

Last Thursday because of the impact of the problem we decided to reduce a single pod instance our workers’ deployments in our kubernetes cluster. The aim of this action was to check if the intermittently pauses execution disappears or has a minor impact on the environment/workflows.

It seems that with this single worker deployment the problem has disappeared from the last Thursday. Maybe the multi instances worker deployment was the root cause of the problem because of some condition race between the different worker instances.