Reading the metrics docs: the job-acquired-failure describes as:
The number of jobs that were acquired but could not be locked for execution due to another job executor locking/executing the jobs in parallel.
Is there further description somewhere that explains how to read this metric as a actionable indicator? Is a large number of job-acquired-failure expected in normal operating circumstances? What is expected vs error-like behaviour?
@Niall any insights on this?
My understanding of this metric is job acquisition is based on two queries;
The first is a select which identifies a set of candidate jobs ready to be acquired.
The second is an update which sets the lock on the targeted set of jobs to be acquired.
If the number of updated rows in the second update is less than the input number of rows, then the difference is the count which is marked against job-acquired-failures. That being the case, I would anticipate this only happens in either a cluster, or during failover from one ode to another. If I get time, I’ll try and identify the job executor code to confirm or deny…
So agreed that what it appears. But this means the larger the cluster the more errors will be thrown. So my question is what is the purpose of that metric. What does a low or high number indicate? How are you supposed to use that metric ?
I would interpret a high number as inefficient. Rationale: - the job executor(s) are potentially tripping over each other and thus the DB and job executors may be wrestling with each other and thrashing.
This blog post Talks to this a little. Hence I would use this metric to adjust backoff rates, job acquisition size etc until this metric is minimized…