Help me understand this OptimisticLockingException

Hi all,

I’m quite new to Camunda, been learning as I go, but came across something I couldn’t explain to myself from documentation, so hoping to get educated :smiling_face:

We are using Camunda 7 as a standalone engine and employing the external task pattern.

Problem

We were seeing some OptimisticLockingExceptions in production. They were easy to fix with an async after once I found out how they occur and managed to reproduce them locally, but getting there took some time. I still don’t quite understand why they occur, though.

This is probably not the smallest possible example to reproduce the situation, but at least it does the trick, and it resembles our production process flow.

Reproduction

I have two separate processes, which signal each other:

Process #1

  • Task 1: Get a random number
  • If number < 0.5, send signal A: Starts Process #2
  • Else, proceed to wait for either:
    • Signal B from Process #2
    • 5 min timer

Process #2

  • Task 2: Get a random number, but implemented so that the task waits a while for more invocations and completes them at the same time
  • If number < 0.5, send signal B: Continue waiting process instances in #1
  • Else, end process

EngineError

If I start a few Process #1’s, I’m getting an EngineError if:

  • At least one process instance takes the else branch and waits at the event based gateway
  • And at least two Process #2 instances complete Task 2 (at the same time) with a value such that signal B would be sent

All instances of Process #2 (beyond the first) trying to send signal B fail to complete Task 2 with this error:
EngineError: Response code 500 (Internal Server Error); Error: ENGINE-03005 Execution of 'DELETE EventSubscriptionEntity[(...)]' failed. Entity was updated by another transaction concurrently.; Type: OptimisticLockingException; Code: 1

Question: Why does this happen?

If I am understanding this correct, this behaviour is due to transaction boundaries: When Task 2 is completed, it actually commits Task 2 plus anything else up to the next transaction boundary. I guess the transaction boundary goes beyond the signal throw event, and only the first signal is able to correlate to the waiting Process #1 instance(s). Any later concurrent signals fail, and, because the whole transaction couldn’t be committed, completing Task 2 fails.

The problem goes away if I add an asynchronous continuation after Task 2: Any number of tasks can then complete at the same time. I’m guessing this does not actually get rid of the signal race condition, but it just gets resolved automatically within the engine?

What I don’t understand is that the document about Transactions in Process lists a Signal Event as an automatic Wait State. I thought that means the element has an implicit “async before”, and therefore the transaction boundary of Task 2 should not go beyond the signal throw, and therefore could be committed safely multiple times?

But obviously my mental model is not quite right :grin:

Where is my thinking going wrong, and am I understanding the situation at all correct?

Thanks in advance! :smiling_face:

Hi Zebranky,
Let me try to clarify.
Your understanding is correct that signal event will persist the state by default but there is a catch

  • Signal throw event persists the state only after it completes the execution
  • Signal receive event persists the state just after creating the event (as soon as the execution reaches that event)

In your case If you use aysnc Save point at Task 2 this issue will not be there because of the Job executor, as separate threads will pick each job there it will make some delay which will help in completing the signal event without issue.

Thanks,
Ranga

This adds a transaction boundary after Task 2.
Transaction boundary commits everything that has happened since the PRIOR boundary.

Signals are not a good way of modeling the processes, since they are a “Shout down the hall” rather than a “Phone call to the person that needs to know”. The way that it’s modeled, you actually have 3 Processes happening:

  1. Process Type 1 - MANUAL START Get Random, Random < 0.5. Starts Process 3 (by shouting in “Hallway 1”)
  2. Process Type 1 - MANUAL START Get Random, Random >= 0.5. Waits for any shouts in “Hallway 2”.
  3. Process Type 2 - Listens for “Shouts”, gets a new Random. If < 0.5, then it shouts in “Hallway 2”

If you have more than 2 processes that are listening in Hallway 2, all of them will continue on. That’s probably not what you want to do.

The reason you’re getting OLEs is because:
PID1 - Random > 0.5 (Wait for Shout)
PID2 - Random > 0.5 (Wait for Shout)
PID3 - Random > 0.5 (Wait for Shout)
PID4 - Random < 0.5 (Shout at Type2)
PID5 - Random < 0.5 (Shout at Type2)
PID6 - Type 2 - Shout back. (Updates PID1, PID2, and PID3)
PID7 - Type 2 - Shout back. (Attempts to update PID1, PID2, and PID3)

Since PID6 and PID7 both attempt to update PID1, PID2, and PID3 at the same time, you get a Lock Exception.

Thanks for the information! And sorry about the slow response, I left for vacation before my message was published :grinning:

In your case If you use async Save point at Task 2 this issue will not be there because of the Job executor, as separate threads will pick each job there it will make some delay which will help in completing the signal event without issue.

Is it really just about the delay? If it was a random delay, I would imagine it could still occasionally fail, if two threads tried to complete the signal at the same time? I thought the internal job executor could just resolve such issues automatically? I mean, the error might still happen when completing the signal event, but I just would never know about it, because my external task has already completed.

Signal throw event persists the state only after it completes the execution

Also, I still don’t quite understand why my original process would fail, if this is true. When I complete my external task, the process should then progress until the signal throw, which would get persisted (N times)? But is it illegal to try to send the same signal twice at the same time? I thought my EngineError occurred somewhere on the receiving end (signal catch).

Can you recommend good sources of information for when and how processes get persisted and what kind of errors can occur there? Should I just start reading through the source code? :grin:

Thanks again!

Thanks for the response!

The “shout down the hall” is actually what we want to do. There is something of a lobby where all processes wait until another process gives them a shout. And sometimes two (or more) processes can shout at the same time.

What I fail to understand is why doesn’t the transaction mechanism persist processes PID4 and PID5 before or after they have shouted? I thought there should be an automatic transaction boundary before a signal throw (per this doc). If there was, my external task should complete without failure, and then the internal job executor could deal with the OptimisticLockingException that occurs during PID67 and PID7.

Why does the problem occur already in the same transaction where I complete my external task?

And this is what’s giving you the optimistic lock exception.

I would argue that the documentation is incomplete.
A Signal Recieve is always a wait state. A Signal Throw is not.

Thanks again! I only now realised to look further into what an EventSubscriptionEntity actually is. Would you say my understanding is correct, if I picture the technical sequence of events as follows:

  1. Process A, which waits for a signal, persists an EventSubscriptionEntity
  2. Process B1, which throws a signal:
    • Finds the EventSubscriptionEntity (or entities)
    • Moves process A beyond the waiting step (?)
    • Deletes the EventSubscriptionEntity
  3. Process B2 would want to do the same as B1, but the EventSubscriptionEntity has already been deleted, which leads to a conflict → EngineError

And the internal job executor could probably resolve this itself, but in this case the conflict occurs in the same transaction as my external task.

I just verified that I can indeed see OptimisticLockingExceptions in the Camunda Engine logs, even after defining a proper async boundary before the Signal throw event. Which is how I expected it to be.

I have a feeling I understand Camunda better now, but do correct me if it seems I’m still off the mark here :smiling_face: