Find failed Jobs and manually execute it via API

Andy · May 23, 2017, 2:37pm

Hello,

in our process we are defining transaction boundaries. So it can happen, that a techniqual exception is thrown and is causing a failed jobs. After the 3. time Camunda will put this inside a failed Job list and creates an incident for it. Now i would like to:

Get all failed jobs
manually retry the failed jobs (like it is possible in the Camund cockpit)

The solution for me currently is:

Get failed jobs:
List<Incident> incidents = runtimeService.createIncidentQuery().incidentType("failedJob").orderByIncidentTimestamp().desc().list();
How can i restart all the correlated jobs? I know that it is possible over the management service. But managementService.executeJob(...) requires a job-id. But with the query above i only get incident Ids.

Of course i could get the executin id for each incident an create a job query via the managementService for that execution id. But is there a good way to handle this? I hope you can help me out. Everything i want is 1. 2. (in example for an processinstance).

I would be glad for you help and some code snippet.

Thanks a lot and best regards
Andy

thorben · May 23, 2017, 2:50pm

Hi Andy,

For failed jobs, the method Incident#getConfiguration returns the job id. As an alternativ, via managementService.createJobQuery().noRetriesLeft().list(), you can find all failed jobs. You won’t be able to sort by incident creation, though.

Another thing: Instead of running ManagementService#executeJob, you could call ManagementService#setJobRetries to resolve the incident. That way, the jobs won’t be executed in the thread that calls the API (which may therefore block for quite some time and/or error out), but will be picked up by the job executor asynchronously.

Cheers,
Thorben

Andy · May 23, 2017, 2:58pm

Thank you Thorben for your quick reply. I will post my solution later here. I will try the set job retries and set it to to 1. Then if an exception occuurs again, it will not causing a crash of the application like executeJob would do. I hope i understand you right. However iam also looking for a way to get the stacktrace of a failed job. Is that possible?

Cheers,
Andy

thorben · May 23, 2017, 3:02pm

Have a look at ManagementService#getJobExceptionStacktrace.

Andy · May 23, 2017, 3:12pm

Ah great - the last question would be - if there is a way to get the task name, where the incident happens. This should be only possible via incident query, right? Because i only see the getActivity- method inside the Incident-Object.
Thanks again

thorben · May 23, 2017, 3:15pm

Or via Job#getJobDefinitionId and then JobDefinition#getActivityId. Either way is fine.

Andy · May 23, 2017, 3:37pm

okay, so in order to get the full taskname i have to receive the complete activityinstance
via runtimeService.getActivityInstance) and than looking for the activityId and extract the name from it. Or can i do it simplier?

thorben · May 24, 2017, 9:02am

I would rather go via RepositoryService#getBpmnModelInstance, then navigate to the activity element and read the name attribute. That is independent of any runtime state.

Andy · May 24, 2017, 10:38am

Finally this is my solution:
I have created my own business object for a failed job with all relevant informations. It is called ‘FailedJob’ here

public List<FailedJob> getAllFailedJobs() {
    ManagementService managementService = this.processEngine.getManagementService();
    RuntimeService runtimeService = processEngine.getRuntimeService();
    List<Incident> incidents = runtimeService.createIncidentQuery().list();
    List<FailedJob> failedJobs = new ArrayList<>();
    incidents.forEach(incident -> {
      FailedJob failedJob = new FailedJob();
      String jobId = incident.getConfiguration();
      failedJob.setJobId(jobId);
      Date incidentTimestamp = incident.getIncidentTimestamp();
      Instant timeInstant = incidentTimestamp.toInstant();
      failedJob.setIncidentTime(OffsetDateTime.ofInstant(timeInstant, ZoneId.systemDefault()));
      failedJob.setErrorMessage(incident.getIncidentMessage());
      failedJob.setExecutionId(incident.getExecutionId());
      failedJob.setStackTrace(managementService.getJobExceptionStacktrace(jobId));
      failedJob.setRollbackTaskName(this.getModelElementNameById(incident.getActivityId(), incident.getProcessDefinitionId()));
      failedJobs.add(failedJob);
    });
    System.out.println("number of Failed jobs: " + failedJobs.size());
    return failedJobs;
  }

Restart a list of jobs asynchronously:

  public void restartFailedJobs(List<FailedJob> jobsToRestart) {
    ManagementService managementService = this.processEngine.getManagementService();
    for(FailedJob job : jobsToRestart){
      System.out.println("Restart job with id " + job.getJobId());
      managementService.setJobRetries(job.getJobId(), 1);
    }
  }

I have also implemented an Handler for incidents, which is called:

  public class CustomIncidentHandler extends org.camunda.bpm.engine.impl.incident.DefaultIncidentHandler {

  public RequestIncidentHandler(String type) {
    super(type);
  }

  @Override
  public void handleIncident(IncidentContext context, String message) {
    Incident incident = super.createIncident(context, message);
    // just a preparation if we want to add function , for instance -> write a mail to an admin if an incident occurs
  }

  @Override
  public void resolveIncident(IncidentContext context) {
    super.resolveIncident(context);
    System.out.println(context.getConfiguration() + ": resolveIncident called");
  }

  @Override
  protected void removeIncident(IncidentContext context, boolean incidentResolved) {
    super.removeIncident(context,incidentResolved);
    System.out.println(context.getConfiguration() + ": remove Incident called!");
  }

}

And have set it up in the configuration:

JtaProcessEngineConfiguration conf = ....;
conf.setCustomIncidentHandlers(Arrays.asList(new CustomIncidentHandler("failedJob")));

I have tested it by throwing an exception inside the workflow. I have 11 failed jobs provoked - that is fine. So the method getAllFailedJobs works as expected. With the restartFailedJob-method there i have identified an behaviour, that is not explainable for me. The job executor executes these jobs asynchronously after some time (1-2 minutes). That seems to be okay, because afterwards i fixed my issues in the workflow, the number of failed jobs where 0 again - like expected. BUT from my sysouts i got this:

number of Failed jobs: 11
Restart job with id 4856
4856: remove Incident called!
4856: resolveIncident called
Restart job with id 4839
4839: remove Incident called!
4839: resolveIncident called
Restart job with id 4873
4873: remove Incident called!
4873: resolveIncident called
Restart job with id 4975
4975: remove Incident called!
4975: resolveIncident called
Restart job with id 4941
4941: remove Incident called!
4941: resolveIncident called
Restart job with id 4958
4958: remove Incident called!
4958: resolveIncident called
Restart job with id 5009
5009: remove Incident called!
5009: resolveIncident called
Restart job with id 4992
4992: remove Incident called!
4992: resolveIncident called
Restart job with id 4890
4890: remove Incident called!
4890: resolveIncident called
Restart job with id 4924
4924: remove Incident called!
4924: resolveIncident called
Restart job with id 4907

– after some time

4907: remove Incident called!
4907: resolveIncident called
4839: remove Incident called!
4839: resolveIncident called
4873: remove Incident called!
4873: resolveIncident called
4856: remove Incident called!
4856: resolveIncident called
4941: remove Incident called!
4992: remove Incident called!
4941: resolveIncident called
4992: resolveIncident called
4975: remove Incident called!
4975: resolveIncident called
4958: remove Incident called!
4958: resolveIncident called
5009: remove Incident called!
5009: resolveIncident called
4890: remove Incident called!
4890: resolveIncident called
4907: remove Incident called!
4907: resolveIncident called
4924: remove Incident called!
4924: resolveIncident called
5510: remove Incident called!
5510: resolveIncident called
2891: remove Incident called!
2891: remove Incident called!
2891: resolveIncident called
5703: remove Incident called!
5703: resolveIncident called
— after some time
number of Failed jobs: 0

As you can see for example the job with id 4958 is removed and resolved twice. How can this be? What has the job executor done? The retries was set to 1. I have also to say, that only with the finish of the second call the failed jobs are 0. Before the second call there are also some incidents found in the database. In a nutshell everything works fine, but this is only a behaviour i can not understand. Maybe someone can explain it to me.

Thanks a lot for your help and all your detailed answers

Best regards,
Andy

thorben · May 24, 2017, 3:22pm

Could you please extract that behavior in a unit test? It is kind of hard to imagine the complete picture from textual description alone. Thank you!

Andy · May 25, 2017, 6:21am

Yes i can try it. But i know that in a testenvironment the job executor is deactivated. As this behavior is related to it, it would be important to activate. Is there a possibility to activate the jpb executor in a unit test environment?

thorben · May 30, 2017, 1:23pm

Yes, just set this property to true: https://github.com/camunda/camunda-engine-unittest/blob/master/src/test/resources/camunda.cfg.xml#L18