Roughly speaking, we receive messages & start workflow (or update variables for existing workflow)
We appear to be running out of memory very reliably. Heap size ramps up reasonably quickly and eventually OOME.
Heap dump analysis is showing quite a large number of instances (couple of million) of HashMap$Node.
They seem to be frequently associated with VariableStore.removedVariables.
They also seem to be frequently associated with CachedDbEntity held be the DbEntityCache
We were using the embedded H2 DB up until recently and we didn’t appear to be having this problem (or at least it was much slower to materialise).
Since connecting up to a new Oracle DB, this problem has appeared.
We do seem to have a different issue with Oracle - in that the External Task executor is getting “stuck” (either fetching external tasks or making a task as complete). Unsure if the build-up of tasks (10’s of thousands) is related to the memory issue.
Does anyone have any ideas what is going on?
Any suggestions welcome.
After further investigation, it appears that the “stuck” task execution is very much related to the problem.
It appears that while marking a task as complete() the CommandInvocationContext.queuedInvocations is growing. It appears that each queued invocation execution adds more invocations to the queue. As such, that queue size is going up not down.
Does anyone know what this means?
It is puzzling that other external tasks are executing fine.
Also (as far as I can tell) this problem has never occurred before when (possibly unrelated) we were using H2.
Is the problem always related to the same external task (same topic)?
Could the external task handler be doing something (calling other service / DB queries etc) that (sometimes) are very slow - causing the task to pile up?
It turns out that if you throw a signal event, and have a Signal Boundary event, you will see exactly the behaviour described above - however, it has nothing to do with Oracle.
In its effort to execute the boundary event for all affected process instances, it will end up loading many entities into RAM, it will take forever, as it is loading 1 by 1. It will eventually run out of memory as I presume it is loading variables into cache also.