Follow‑up to “Custom Tasklist Implementation” – Slowness and DEADLINE_EXCEEDED / INVALID_ARGUMENT on user task operations

Hi Camunda team and community,

This is a follow-up to my earlier thread:

In that discussion, team recommended an event‑driven architecture using User Task Listeners + a custom database layer for implementing a custom tasklist with complex sorting/searching over process variables. We have now implemented that approach and it works functionally, but under load we are seeing slowness and intermittent errors on user-task related operations.

I’d appreciate expert guidance on how to tune and troubleshoot this.

  1. Current setup (recap):
    • Camunda version: 8.8, self-managed (headless)
    • Architecture:
    • We use User Task Listeners to receive user task lifecycle events.
    • A Node.js worker processes these listener jobs and writes user task data + selected process variables into our own database.
    • Our custom tasklist frontend reads exclusively from this local database to:
    • show lists of tasks,
    • sort by business priority and other process variables,
    • perform multi-field search.
    • User operations (write-side):
    • When a user assigns or completes a task in our UI, we call the Orchestration Cluster REST APIs:
    • POST /user-tasks/{userTaskKey}/assignment
    • POST /user-tasks/{userTaskKey}/completion
    • These operations, in turn, trigger the user task listeners we’re consuming.
    • Load characteristics (approximate):
    • ~3,000+ active user tasks at any time
    • ~30+ concurrent users in the custom tasklist
    • We aim for sub-second perceived latency for assign/complete operations.

  2. Problem 1 – Slowness and DEADLINE_EXCEEDED / HTTP 504 on assign/complete:
    Under higher load, when users assign or complete tasks via the Orchestration APIs, we sometimes observe:
    • HTTP status: 504
    • Error body:
    • title: “DEADLINE_EXCEEDED”
    • status: 504
    • detail: “Expected to handle request, but request timed out between gateway and broker”

Initially, this was happening often enough to be noticeable in the UI.

Configuration change we tried:
• In our listener worker we had configured maxParallelJobs to 30.
• We increased maxParallelJobs from 30 to 100.
Effect:
• After increasing maxParallelJobs we no longer see the DEADLINE_EXCEEDED errors.
• However, we still observe slowness in some user task operations:
• Assignment or completion sometimes takes significantly longer than expected.

It’s not clear to us whether we have now simply shifted the bottleneck or whether our current concurrency settings are sub-optimal.

  1. Problem 2 – “Listener failed: INVALID_ARGUMENT” with no further details:
    More recently, after tuning maxParallelJobs, we began to sometimes see errors of the form:
    • Message: Listener failed: INVALID_ARGUMENT
    • There is no additional detail returned with this error.
    • Retrying the same user task operation (assign/complete) does not help – it fails again. (retry from operate works)
    • We also do not see any clear error logs in our current log configuration that explain:
    • what exactly is invalid,
    • which RPC/operation reported INVALID_ARGUMENT or which payload (variables, headers, etc.) is causing the issue.

Given the gateway/gRPC documentation, we know INVALID_ARGUMENT can be raised, for example, if:
• variable payloads are not valid JSON (or root is not an object),
• some argument combinations are not allowed for a given RPC,
• or there’s some listener-specific validation failing.

However, from the outside we only see a very generic Listener failed: INVALID_ARGUMENT and no concrete reason.

  1. Questions for the Camunda experts:

  2. Recommended configuration for user task listeners under load:
    • With a custom user task listener worker and 3k+ active tasks / 30+ concurrent users, is increasing maxParallelJobs from 30 to 100 a reasonable approach?
    • Are there recommended or typical ranges for:
    • maxParallelJobs / maxJobsToActivate,
    • request timeouts,
    • and other relevant worker options.
    • Could a too-high maxParallelJobs lead to contention or blocking that manifests as slowness on the Orchestration APIs(assign/complete)?

  3. Understanding and handling DEADLINE_EXCEEDED from Orchestration APIs:
    • From the docs, we understand that 504 / DEADLINE_EXCEEDED can occur when:
    • the gateway to broker communication times out, or
    • user task listeners for that lifecycle event do not complete within the request timeout.
    • In a setup with user task listeners:
    • Is there a recommended timeout setting or pattern for Orchestration API calls that depend on listeners?
    • At what point is it more appropriate to return quickly and poll for completion, versus waiting for the listeners to finish in the same HTTP request?

  4. Diagnosing “Listener failed: INVALID_ARGUMENT”
    • Where exactly should we expect detailed logs for this kind of error? Some specific logger name / category related to task or user task listeners?
    • Is there any way to get more structured error details (e.g. which field was invalid, or the specific gRPC method and reason) when a listener fails with INVALID_ARGUMENT?
    • Are there common causes for INVALID_ARGUMENT in the context of user task listeners that we should double-check (for example, invalid JSON variables in SetVariables, unsupported listener configurations, etc.)?

  5. Best practices for high-throughput user task operations:
    • For this architecture (custom tasklist, local DB, user task listeners + Orchestration APIs):
    • Are there any best-practice examples or reference configurations for:
    • listener worker concurrency,
    • gateway/broker timeout settings,
    • and Orchestration API timeouts.
    • Any advice on patterns to avoid that are known to cause slowdowns, e.g. heavy logic inside listeners, large variable payloads on assignment/completion, etc.?

If more information (config snippets, exact error payloads, topology, etc.) would help, I’m happy to share that as well.
Thank you in advance for any guidance.

This is a comprehensive performance tuning question for User Task Listeners under load with specific DEADLINE_EXCEEDED and INVALID_ARGUMENT errors. I found the following relevant resources:

Does this help? If not, can anyone from the community jump in? :waving_hand:


:light_bulb: Hints: Use the Ask AI feature in Camunda’s documentation to chat with AI and get fast help. Report bugs and features in Camuda’s GitHub issue tracker. Trust the process. :robot:

Tuning is as much an art as it is a science.
I would recommend opening a support ticket with Camunda rather that trying to get advice from the other users on the forum here.

Thanks for the input.

I’m aware that proper tuning often needs deep, context-specific analysis, and opening a support ticket is definitely on the table. That said, this is exactly the sort of problem where real-world experience from other users can be extremely valuable as well.
So while I appreciate the “open a ticket” suggestion, I’m still looking for community members who are willing to share what worked (or didn’t) in practice, beyond the generic “contact support.”