FetchAndLock long polling response times are inconsistent with multiple instances of Camunda Engine

We are running 2 instances of the Camunda Process Engine (configured with SQL Server) on a Kubernetes cluster with load balancing.
For each fetchAndLock request to the engine, we use long polling with an asyncResponseTimeout of of 2 mins.

We have noticed that in certain random but frequent situations, the fetchAndLock request does not return immediately when there are new external tasks in Camunda. It will, however, return those new external tasks at the end of the timeout period of 2 mins for that particular request.

When there is only a single instance of the process engine, we do not see this behavior. It only happens when there are more than 1 instance.

A sample scenario of the problem:

  1. Client requests FetchAndLock using long polling.
  2. Client receives external task immediately.
  3. Client delegates worker to process task.
  4. Client sends another request using long polling (request is kept open).
  5. After a few seconds, the external task work is finished and worker sends completion to Camunda.
  6. Request made in step (4) still does not receive the response until timeout of 2 mins completes (even though completion was sent way earlier and the new external task was made available).

My guess is that in (3) the request is sent to camunda engine instance 1. But the completion is sent to camunda engine instance 2 through the LB. This possibly causes the event driven long polling mechanism to not return immediately. I might well be wrong with my guess but some help in this matter would be greatly appreciated as our system is now having severe performance issues after using long pollling.

1 Like

Hi @vigb,

your guess makes sense to me. Maybe you can deliver information about the node with the fetch-and-lock call and use them for complete to make the load balancer choosing the same node.

Hope this helps, Ingo