Strategy for scaling Job Workers in Zeebe

As per Missing metrics related to process processing - #3 by jwulf, Pending Jobs metric has been removed from Zeebe.

What is your strategy to scale our Job Worker machines? Ideally, we need to know the backlog in Zeebe so that we can scale our workers to clear the backlog quicky.
Without a clear metric, I am not sure how do we do this?

Hi @guptaashish327 - the current recommendation is to use metrics within the job worker. Specifically, the Java client exposes two metrics (jobs activated and jobs handled), and by subtracting both counters, you can derive the count of queued or buffered jobs - jobs which have yet to be handled by the worker. This can help you tune your workers, e.g. scaling in or out, tuning the amount of jobs activated, etc.

Does that help?

@nathan.loding Thanks for your response.
But, (jobs activated and jobs handled) metrics will not help us in figuring out backlog in Zeebe (Jobs which needs to be activated by workers).
As per my understanding, Job Activated means jobs that are activated by workers for processing.

We were looking for “Jobs which are pending to be picked by workers”.

@guptaashish327 - using those metrics is the current recommendation, but I understand your point. Based on previous questions you’ve asked, it looks like you have an enterprise license; I would recommend reaching out to support if you need a stronger recommendation. In the meantime I can get some feedback from our engineers, but since this is a community forum and not an official support channel, it may take some time before I have any answers.

Hi @nathan.loding Thanks for responding.

Getting to know this will be helpful for

  1. Having a bird’s eye view of backlog Zeebe has
  2. Putting a better worker / downstream servers scaling strategy

Hope you get my point.

Regarding license, we don’t have enterprise license as of today, are still discussing internally due to heavy prices.
Waiting for your answer.

@guptaashish327 - I’ve sent this feedback along to our engineering team. In the meantime, I spoke with one of our consultants, and they provided some additional information that may help:

  1. “We have some metrics exposed via prometheus that might help. You can see these in the Grafana dashboard. For example, if you look a the Throughput section of a Zeebe Grafana dashboard, there’s a chart for Jobs Creation per second vs Jobs completion per second . So, customers could potentially look at the difference to know when to add more job workers. Job Activation Time and Job Life Time charts under the Latency section could shed some light.”

  2. “if job workers are implemented following our best practices to use non-blocking asynchronous idempotent calls, then a single job worker instance can handle surprisingly high throughput.” – in other words, our experience shows that, quite often, a well written job worker implementation can scale itself to handle additional throughput (scaling handled via the previously mentioned metrics), rather than needing to add additional job workers to the pool.

(Also a quick reminder that Camunda requires an enterprise license to run in any production capacity.)

Thank you @nathan.loding .
I will see if we can use the metrics mentioned by you.

Regarding Licensing, we use Zeebe 8.5 only, which is free as per Licensing Update for Camunda 8 Self-Managed | Camunda

We are discussing internally for licensing, if we move to 8.6 / or encounter any concerns.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.