Finally this is my solution:
I have created my own business object for a failed job with all relevant informations. It is called ‘FailedJob’ here
public List<FailedJob> getAllFailedJobs() {
ManagementService managementService = this.processEngine.getManagementService();
RuntimeService runtimeService = processEngine.getRuntimeService();
List<Incident> incidents = runtimeService.createIncidentQuery().list();
List<FailedJob> failedJobs = new ArrayList<>();
incidents.forEach(incident -> {
FailedJob failedJob = new FailedJob();
String jobId = incident.getConfiguration();
failedJob.setJobId(jobId);
Date incidentTimestamp = incident.getIncidentTimestamp();
Instant timeInstant = incidentTimestamp.toInstant();
failedJob.setIncidentTime(OffsetDateTime.ofInstant(timeInstant, ZoneId.systemDefault()));
failedJob.setErrorMessage(incident.getIncidentMessage());
failedJob.setExecutionId(incident.getExecutionId());
failedJob.setStackTrace(managementService.getJobExceptionStacktrace(jobId));
failedJob.setRollbackTaskName(this.getModelElementNameById(incident.getActivityId(), incident.getProcessDefinitionId()));
failedJobs.add(failedJob);
});
System.out.println("number of Failed jobs: " + failedJobs.size());
return failedJobs;
}
Restart a list of jobs asynchronously:
public void restartFailedJobs(List<FailedJob> jobsToRestart) {
ManagementService managementService = this.processEngine.getManagementService();
for(FailedJob job : jobsToRestart){
System.out.println("Restart job with id " + job.getJobId());
managementService.setJobRetries(job.getJobId(), 1);
}
}
I have also implemented an Handler for incidents, which is called:
public class CustomIncidentHandler extends org.camunda.bpm.engine.impl.incident.DefaultIncidentHandler {
public RequestIncidentHandler(String type) {
super(type);
}
@Override
public void handleIncident(IncidentContext context, String message) {
Incident incident = super.createIncident(context, message);
// just a preparation if we want to add function , for instance -> write a mail to an admin if an incident occurs
}
@Override
public void resolveIncident(IncidentContext context) {
super.resolveIncident(context);
System.out.println(context.getConfiguration() + ": resolveIncident called");
}
@Override
protected void removeIncident(IncidentContext context, boolean incidentResolved) {
super.removeIncident(context,incidentResolved);
System.out.println(context.getConfiguration() + ": remove Incident called!");
}
}
And have set it up in the configuration:
JtaProcessEngineConfiguration conf = ....;
conf.setCustomIncidentHandlers(Arrays.asList(new CustomIncidentHandler("failedJob")));
I have tested it by throwing an exception inside the workflow. I have 11 failed jobs provoked - that is fine. So the method getAllFailedJobs
works as expected. With the restartFailedJob
-method there i have identified an behaviour, that is not explainable for me. The job executor executes these jobs asynchronously after some time (1-2 minutes). That seems to be okay, because afterwards i fixed my issues in the workflow, the number of failed jobs where 0 again - like expected. BUT from my sysouts i got this:
number of Failed jobs: 11
Restart job with id 4856
4856: remove Incident called!
4856: resolveIncident called
Restart job with id 4839
4839: remove Incident called!
4839: resolveIncident called
Restart job with id 4873
4873: remove Incident called!
4873: resolveIncident called
Restart job with id 4975
4975: remove Incident called!
4975: resolveIncident called
Restart job with id 4941
4941: remove Incident called!
4941: resolveIncident called
Restart job with id 4958
4958: remove Incident called!
4958: resolveIncident called
Restart job with id 5009
5009: remove Incident called!
5009: resolveIncident called
Restart job with id 4992
4992: remove Incident called!
4992: resolveIncident called
Restart job with id 4890
4890: remove Incident called!
4890: resolveIncident called
Restart job with id 4924
4924: remove Incident called!
4924: resolveIncident called
Restart job with id 4907
– after some time
4907: remove Incident called!
4907: resolveIncident called
4839: remove Incident called!
4839: resolveIncident called
4873: remove Incident called!
4873: resolveIncident called
4856: remove Incident called!
4856: resolveIncident called
4941: remove Incident called!
4992: remove Incident called!
4941: resolveIncident called
4992: resolveIncident called
4975: remove Incident called!
4975: resolveIncident called
4958: remove Incident called!
4958: resolveIncident called
5009: remove Incident called!
5009: resolveIncident called
4890: remove Incident called!
4890: resolveIncident called
4907: remove Incident called!
4907: resolveIncident called
4924: remove Incident called!
4924: resolveIncident called
5510: remove Incident called!
5510: resolveIncident called
2891: remove Incident called!
2891: remove Incident called!
2891: resolveIncident called
5703: remove Incident called!
5703: resolveIncident called
— after some time
number of Failed jobs: 0
As you can see for example the job with id 4958 is removed and resolved twice. How can this be? What has the job executor done? The retries was set to 1. I have also to say, that only with the finish of the second call the failed jobs are 0. Before the second call there are also some incidents found in the database. In a nutshell everything works fine, but this is only a behaviour i can not understand. Maybe someone can explain it to me.
Thanks a lot for your help and all your detailed answers
Best regards,
Andy