We have a failed job incident handler that will create a new process instance when a particular failure occurs. The process instance shares the same business key, but otherwise exists on its own and is its own root process instance.
We are seeing weird behavior where the process instance is created with createProcessInstanceByKey and has an ID, but the process instance is never retrievable via Runtime Service or History Service. It seems to have never been created in the first place. Looking at the Camunda logs, we can see that the INSERT command is created for the process instance, but it is never committed.
This consistently happens when there are multiple incident handlers executing in parallel and if one of the other parallel incident handlers fails. If there is only a single incident, then the process instance gets created as expected.
Are there any particular known scenarios that can cause a process instance created by createProcessInstanceByKey to not fully complete its creation?
I think you’ve asnwered your own question. All the actions of the incident handlers take place within one transaction. If a handler fails the transaction is rolled back backing out everything that was performed within it. If you manage to install just one incident handler and call the other handler from it (a composite), you will have more control of when a failure get through to the engine causing the transaction to roll back.
Thanks for your reply! The other handlers fail because they are expecting the first process instance (created by the first incident handler) to already exist.
Once we realized the issue was that the transaction wasn’t completing, we were able to update our logic to ensure that the process instance creation was in its own completed transaction.
Do you know what the determining factor is for whether-or-not multiple incident handlers will share the same transaction or not? Is it any incidents that happen at approximately the same time? (For context, our scenario was parallel call activities, each with their own UserTask which could raise the error)
IMO an execution of an incident handler (and hence its transaction) roughly corresponds to a failed job. I.e. if you have parallel activities you’ll get two separate transactions for the incident handlers. Regardless of how near in time the incidents occurred.
This is how I understand it; I’m not an expert in this area though.