Jobs are not being executed (or with heavy delay)

Sascha_Karcher · April 21, 2016, 10:12am

We’re using the Docker Camunda 7.4 Image (i.e. wildfly application server).
We’ve got 3 or 4 processe definitions running at once (some of them running 30 or more instances) which usually work but one or two of them seem to be stuck for no appearent reason. According to our logs (each service task creates a log message before doing anything else) those jobs sometimes don’t seem to start which results in blue dots attached to those jobs that don’t go anywhere for quite a while.
We managed to force them onward with a REST command, but that is no suitable solution.

Any ideas on how we can solve this issue?

Sascha_Karcher · April 21, 2016, 3:39pm

Just Found that the problem seems to be related to long running Service Tasks. Some of our service tasks take longer then 5 minutes and the the wildfly transaction timeout triggers which is displayed in the Wildfly log as follows:

17:33:57,709 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff0a000158:-618b351a:5718efa0:4cd in state  RUN
17:33:57,709 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 0) ARJUNA012095: Abort of action id 0:ffff0a000158:-618b351a:5718efa0:4cd invoked while multiple threads active within it.
17:33:57,709 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 0) ARJUNA012108: CheckedAction::check - atomic action 0:ffff0a000158:-618b351a:5718efa0:4cd aborting with 1 threads active!
17:33:57,719 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 0) ARJUNA012121: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 0,5,main] successfully canceled TX 0:f

Afterwards the whole process engine / job executor seems to be confused.
Anybody similar experiences?

thorben · April 21, 2016, 4:06pm

Hi @Sascha_Karcher,

Just a guess:

Assuming these jobs always take more than 5 minutes, the following problem may occur. The job executor ensures that a job is only executed once by setting a time-based lock. By default, this lock time is 5 minutes. If the lock expires, the job may be acquired and executed again. Now if you have a job that always takes longer to execute than the expiration time, then the job is going to be acquired again and again. Everytime a job execution thread finishes execution, it is going to fail due to optimistic locking since the job has been acquired again in the meantime. Perhaps that is the reason why the job executor is making almost no progress.

You could try to extend the lock expiration time in the process engine configuration. For example with a shared engine on Wildfly, this can be done in the camunda subsystem configuration and in particular via the property lockTimeInMillis on the job-acquisition element. See also https://docs.camunda.org/manual/latest/reference/deployment-descriptors/tags/job-executor/#job-acquisition-configuration-properties

Cheers,
Thorben

Sascha_Karcher · April 22, 2016, 2:07pm

Hi Thorben,

Thanks for the feedback.
Accordingly we changed the our settings as follows:

We increased the transaction timeout in the standalone.xml of wildfly:

<subsystem xmlns="urn:jboss:domain:transactions:2.0">
            <core-environment>
                <process-id>
                    <uuid/>
                </process-id>
            </core-environment>
            <recovery-environment socket-binding="txn-recovery-environment" status-socket-binding="txn-status-manager"/>
            <coordinator-environment default-timeout="1200"/>
        </subsystem>

The important line is <coordinator-environment default-timeout="1200"/> which sets the timeout for the transactions in EJBs to 20 minutes. The default is 5 minutes.

Additionally we increased the lockTimeInMillis to the same value:

    <job-executor>
                    <thread-pool-name>
                        job-executor-tp
                    </thread-pool-name>
                    <job-acquisitions>
                        <job-acquisition name="default">
                            <acquisition-strategy>
                                SEQUENTIAL
                            </acquisition-strategy>
                            <properties>
                                <property name="lockTimeInMillis">
                                    1200000
                                </property>
                                <property name="waitTimeInMillis">
                                    5000
                                </property>
                                <property name="maxJobsPerAcquisition">
                                    3
                                </property>
                            </properties>
                        </job-acquisition>
                    </job-acquisitions>
                </job-executor>

The important part here is <property name="lockTimeInMillis">1200000</property>

To sum it up
The assumption (according to Thorben) is that in case the transaction timeout of the JEE server triggers the process engine does not properly recognize that the job failed to execute and in consequence the job executor gets confused.