All processes are stuck

Ashwini_P · December 2, 2020, 12:47am

Hi Team, we are facing a challenge in Camunda where all the flows are stuck and every time we are just seeing this message in logs. Job acquisition thread sleeping is always 100/99 milis and only after Camunda restart it starts working but after some time it again hangs. Can you please suggest how to resolve this issue. We have deployed Camunda using docker image.

01-Dec-2020 11:23:58.920 FINE [Thread-6] org.camunda.commons.logging.BaseLogger.logDebug ENGINE-14011 Job acquisition thread sleeping for 100 millis
01-Dec-2020 11:23:59.021 FINE [Thread-6] org.camunda.commons.logging.BaseLogger.logDebug ENGINE-14012 Job acquisition thread woke up
01-Dec-2020 11:23:59.021 FINE [Thread-6] org.camunda.commons.logging.BaseLogger.logDebug ENGINE-14022 Acquired 0 jobs for process engine ‘default’: []
01-Dec-2020 11:23:59.021 FINE [Thread-6] org.camunda.commons.logging.BaseLogger.logDebug ENGINE-14023 Execute jobs for process engine ‘default’: [ae1149f3-33c3-11eb-a0bd-565d562f3e7d]
01-Dec-2020 11:23:59.021 FINE [Thread-6] org.camunda.commons.logging.BaseLogger.logDebug ENGINE-14023 Execute jobs for process engine ‘default’: [f377b55f-33c3-11eb-a0bd-565d562f3e7d]
01-Dec-2020 11:23:59.021 FINE [Thread-6] org.camunda.commons.logging.BaseLogger.logDebug ENGINE-14023 Execute jobs for process engine ‘default’: [8a5e7f79-33c3-11eb-a0bd-565d562f3e7d]
01-Dec-2020 11:23:59.021 FINE [Thread-6] org.camunda.commons.logging.BaseLogger.logDebug ENGINE-14011 Job acquisition thread sleeping for 100 millis
01-Dec-2020 11:23:59.121 FINE [Thread-6] org.camunda.commons.logging.BaseLogger.logDebug ENGINE-14012 Job acquisition thread woke up
01-Dec-2020 11:23:59.122 FINE [Thread-6] org.camunda.commons.logging.BaseLogger.logDebug ENGINE-14022 Acquired 0 jobs for process engine ‘default’: []
01-Dec-2020 11:23:59.122 FINE [Thread-6] org.camunda.commons.logging.BaseLogger.logDebug ENGINE-14023 Execute jobs for process engine ‘default’: [ae1149f3-33c3-11eb-a0bd-565d562f3e7]

thorben · December 2, 2020, 12:01pm

Hi,

Check out this blog post and apply its suggestions: https://camunda.com/blog/2019/10/job-executor-what-is-going-on-in-my-process-engine/

Cheers,
Thorben

Ashwini_P · December 3, 2020, 2:05pm

Hi Thorben,
According to this link only we enabled logs, made certain changes in job executor as shown below , but still jobs are getting stuck. Please confirm if these changes looks ok? After restart it works for some time like 1 hour and then everything gets stuck at first stage. Also we observed that process gets stuck when there are some jobs locked in RU_JOB table and select query will not be called but as per logs it keeps trying locked jobs. If we instantiate any process it will call insert query, but select part is never called.

Ashwini_P · December 3, 2020, 2:11pm

<job-executor>
        <job-acquisition name="default">
          <properties>
            <property name="maxJobsPerAcquisition">10</property>
            <property name="waitTimeInMillis">5000</property>
            <property name="lockTimeInMillis">120000</property>
    		<property name="maxWait">60000</property>
    		<property name="backoffTimeInMillis">1</property>
    		<property name="maxBackoff">5</property>
    		<property name="backoffDecreaseThreshold">10</property>
    		<property name="waitIncreaseFactor">2</property>
    		
          </properties>
        </job-acquisition>
        <properties>
          <!-- Note: the following properties only take effect in a Tomcat environment -->
          <property name="queueSize">10</property>
          <property name="corePoolSize">12</property>
          <property name="maxPoolSize">15</property>
          <property name="keepAliveTime">0</property>
        </properties>
      </job-executor>

      <process-engine name="default">
        <job-acquisition>default</job-acquisition>
        <configuration>org.camunda.bpm.engine.impl.cfg.StandaloneProcessEngineConfiguration</configuration>
        <datasource>java:jdbc/ProcessEngine</datasource>

        <properties>
          <property name="history">full</property>
          <property name="databaseSchemaUpdate">true</property>
          <property name="authorizationEnabled">true</property>
          <property name="jobExecutorDeploymentAware">false</property>
        </properties>

Ashwini_P · December 16, 2020, 12:17pm

Is there any solution from anyone on this? we are still facing this issue.

Ingo_Richtsmeier · December 16, 2020, 12:45pm

Hi @Ashwini_P,

your settings look OK for me. Suspicious entries from your log snippet show that every time the same jobs are executed:

What is behind this job id? I would start digging into the database, what these jobs are created for.

Hope this helps, Ingo

Ashwini_P · December 16, 2020, 12:51pm

Hi @Ingo_Richtsmeier,

Yes I digged down in the table and these jobs are uploading some file to Box. But when we restart camunda pod same jobs continue for some time. But stucks after a while. I mean Job Executor goes into infinite loop and it will not query database. (Select query is not run or executed). Is there a way to release threads or do something without restarting camunda by which Job executor can start picking up job

Ingo_Richtsmeier · December 16, 2020, 1:07pm

Hi @Ashwini_P,

how long should the upload take? Are they timed out and the work (upload) never completed successfully? Or did they just take too long?

As you nailed it down to a single service task, inspect the implementation of this service task and add some logging statement to the code. If the upload fails, you can throw an exeception to create an incident after all retries are done.

You set your lock time to two minutes. After two minutes the engines assumes that the job died and executes it again.

What is the content of retries_?

Hope this helps, Ingo

Ashwini_P · December 16, 2020, 1:22pm

Hi @Ingo_Richtsmeier,

   Thank you for your response.

how long should the upload take? Are they timed out and the work (upload) never completed successfully? Or did they just take too long?
— work completed in back end, like uploading part. But in cockpit it is still shown as running instance.

As you nailed it down to a single service task, inspect the implementation of this service task and add some logging statement to the code. If the upload fails, you can throw an exeception to create an incident after all retries are done.
– Yes logging statements we will add and check further. There are exception handlers to check the response code and put some message.

You set your lock time to two minutes. After two minutes the engines assumes that the job died and executes it again.
— How to set lock time, it is at each task level or process level? please suggest.

What is the content of retries_?
– Retries is 3 always once it is stuck.

Ingo_Richtsmeier · December 16, 2020, 4:56pm

Hi @Ashwini_P,

you set the lock time in the properties:

They are only available on the job executor settings for all processes.

Maybe this is the root cause?

Hope this helps, Ingo

Ashwini_P · December 17, 2020, 5:45am

Hi @Ingo_Richtsmeier,

How error handlers cause the issue? It is just checking the response code and taking respective actions like setting some parameter. One example as below.

<camunda:scriptscriptFormat="javascript">if(statusCode==200){


	if(S(response).hasProp("id")){
	S(response).prop("id").stringValue();
	}else{
         	"error_resp"
	}

      }else{

	"error_resp"
       }

</camunda:script>

Ingo_Richtsmeier · December 17, 2020, 11:29am

Hi @Ashwini_P,

how do you implement your service tasks at all?

Could you please provide a deeper insight with a process model and the code that implements the service?

Cheers, Ingo

Ashwini_P · December 17, 2020, 2:12pm

Hi @Ingo_Richtsmeier,

All services are written using Node JS Loopback 3 (which internally calls other java code). In Process flow these services are accessed using node js API endpoints through http_connector.

Ingo_Richtsmeier · December 18, 2020, 9:51am

Hi @Ashwini_P,

as you use connectors, your influence of the exection of the service task is limited.

On the long term I suppose to move to external service tasks, where you gain more control about the service execution. The external task claint for Java script will help you here: https://github.com/camunda/camunda-external-task-client-js

To resolve the short term problems, you can increase the log output of the connector to gain more insights. The logging level is org.camunda.bpm.connect: https://github.com/camunda/camunda-connect/blob/master/core/src/main/java/org/camunda/connect/impl/ConnectLogger.java.

Hope this helps, Ingo

guptaashish327 · November 7, 2023, 8:01am

@Ashwini_P Any solution you found till now? We face the same issue.
Upon deep dive I am getting

java.util.concurrent.RejectedExecutionException: Task org.camunda.bpm.engine.impl.jobexecutor.ExecuteJobsRunnable@38960577 rejected from java.util.concurrent.ThreadPoolExecutor@605380a6[Running, pool size = 10, active threads = 10, queued tasks = 3, completed tasks = 138] at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2065) at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:833) at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1365) at org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor.execute(ThreadPoolTaskExecutor.java:360) at org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor.executeJobs(SpringJobExecutor.java:59) at org.camunda.bpm.engine.impl.jobexecutor.SequentialJobAcquisitionRunnable.executeJobs(SequentialJobAcquisitionRunnable.java:139) at org.camunda.bpm.engine.impl.jobexecutor.SequentialJobAcquisitionRunnable.run(SequentialJobAcquisitionRunnable.java:81) at java.base/java.lang.Thread.run(Thread.java:833)

It won’t process event timer jobs

nathan.loding · November 7, 2023, 6:02pm

Hi @guptaashish327 - welcome to the Camunda forums! This topic is 3 years old, and much has changed since then. Can you start a new topic with your issue, and provide as many details as you can? Thanks!