Hi Camunda team and community,
This is a follow-up to my earlier thread:
In that discussion, team recommended an event‑driven architecture using User Task Listeners + a custom database layer for implementing a custom tasklist with complex sorting/searching over process variables. We have now implemented that approach and it works functionally, but under load we are seeing slowness and intermittent errors on user-task related operations.
I’d appreciate expert guidance on how to tune and troubleshoot this.
-
Current setup (recap):
• Camunda version: 8.8, self-managed (headless)
• Architecture:
• We use User Task Listeners to receive user task lifecycle events.
• A Node.js worker processes these listener jobs and writes user task data + selected process variables into our own database.
• Our custom tasklist frontend reads exclusively from this local database to:
• show lists of tasks,
• sort by business priority and other process variables,
• perform multi-field search.
• User operations (write-side):
• When a user assigns or completes a task in our UI, we call the Orchestration Cluster REST APIs:
• POST /user-tasks/{userTaskKey}/assignment
• POST /user-tasks/{userTaskKey}/completion
• These operations, in turn, trigger the user task listeners we’re consuming.
• Load characteristics (approximate):
• ~3,000+ active user tasks at any time
• ~30+ concurrent users in the custom tasklist
• We aim for sub-second perceived latency for assign/complete operations. -
Problem 1 – Slowness and DEADLINE_EXCEEDED / HTTP 504 on assign/complete:
Under higher load, when users assign or complete tasks via the Orchestration APIs, we sometimes observe:
• HTTP status: 504
• Error body:
• title: “DEADLINE_EXCEEDED”
• status: 504
• detail: “Expected to handle request, but request timed out between gateway and broker”
Initially, this was happening often enough to be noticeable in the UI.
Configuration change we tried:
• In our listener worker we had configured maxParallelJobs to 30.
• We increased maxParallelJobs from 30 to 100.
Effect:
• After increasing maxParallelJobs we no longer see the DEADLINE_EXCEEDED errors.
• However, we still observe slowness in some user task operations:
• Assignment or completion sometimes takes significantly longer than expected.
It’s not clear to us whether we have now simply shifted the bottleneck or whether our current concurrency settings are sub-optimal.
- Problem 2 – “Listener failed: INVALID_ARGUMENT” with no further details:
More recently, after tuning maxParallelJobs, we began to sometimes see errors of the form:
• Message: Listener failed: INVALID_ARGUMENT
• There is no additional detail returned with this error.
• Retrying the same user task operation (assign/complete) does not help – it fails again. (retry from operate works)
• We also do not see any clear error logs in our current log configuration that explain:
• what exactly is invalid,
• which RPC/operation reported INVALID_ARGUMENT or which payload (variables, headers, etc.) is causing the issue.
Given the gateway/gRPC documentation, we know INVALID_ARGUMENT can be raised, for example, if:
• variable payloads are not valid JSON (or root is not an object),
• some argument combinations are not allowed for a given RPC,
• or there’s some listener-specific validation failing.
However, from the outside we only see a very generic Listener failed: INVALID_ARGUMENT and no concrete reason.
-
Questions for the Camunda experts:
-
Recommended configuration for user task listeners under load:
• With a custom user task listener worker and 3k+ active tasks / 30+ concurrent users, is increasing maxParallelJobs from 30 to 100 a reasonable approach?
• Are there recommended or typical ranges for:
• maxParallelJobs / maxJobsToActivate,
• request timeouts,
• and other relevant worker options.
• Could a too-high maxParallelJobs lead to contention or blocking that manifests as slowness on the Orchestration APIs(assign/complete)? -
Understanding and handling
DEADLINE_EXCEEDEDfrom Orchestration APIs:
• From the docs, we understand that 504 / DEADLINE_EXCEEDED can occur when:
• the gateway to broker communication times out, or
• user task listeners for that lifecycle event do not complete within the request timeout.
• In a setup with user task listeners:
• Is there a recommended timeout setting or pattern for Orchestration API calls that depend on listeners?
• At what point is it more appropriate to return quickly and poll for completion, versus waiting for the listeners to finish in the same HTTP request? -
Diagnosing “Listener failed: INVALID_ARGUMENT”
• Where exactly should we expect detailed logs for this kind of error? Some specific logger name / category related to task or user task listeners?
• Is there any way to get more structured error details (e.g. which field was invalid, or the specific gRPC method and reason) when a listener fails with INVALID_ARGUMENT?
• Are there common causes for INVALID_ARGUMENT in the context of user task listeners that we should double-check (for example, invalid JSON variables in SetVariables, unsupported listener configurations, etc.)? -
Best practices for high-throughput user task operations:
• For this architecture (custom tasklist, local DB, user task listeners + Orchestration APIs):
• Are there any best-practice examples or reference configurations for:
• listener worker concurrency,
• gateway/broker timeout settings,
• and Orchestration API timeouts.
• Any advice on patterns to avoid that are known to cause slowdowns, e.g. heavy logic inside listeners, large variable payloads on assignment/completion, etc.?
If more information (config snippets, exact error payloads, topology, etc.) would help, I’m happy to share that as well.
Thank you in advance for any guidance.