Job Executor hangs or stops acquiring jobs? (solved: HTTP-Connector stuck in endless job due to long http request/no response)

@thorben Additional question:

When you delete an active process instance that has a job that is currently executing; Does the Job get cancelled mid execution?

Edit: Did some tests. Looks like if a Job is being executed and the process instance is deleted the job will not be cancelled. I tested this by creating a load of jobs and the HTTP-connector endless response issue occurred (i think). And i deleted some potentially problematic process instances, but the job executor did not pick up the changes. Once i restarted the container, the job executor went back to normal

Edit: Have narrowed the problem down to be related to http-connector requests. Further testing tomorrow related to outside network issues (likely).

@thorben can the httpclient configuration settings be accessed within the connector variable of http-connector? For example using a listener or a script in one of the inputs to modify the timeouts? I looked through the docs, but I could not clearly see where the access could be.

Hey Stephen,

It runs until completion and then fails with an OptimisticLockingException since the job and execution tree was updated in the meantime.

No, same optimistic locking behavior as above.

I don’t think so, but haven’t really looked into this. The config I linked to above is used when the client is built. This only happens once, so you can’t configure the timeout on per-request basis this way. But maybe there’s another way with Apache Http client to do this.

@thorben okay great. We have narrowed it down then to specific use case network connectivity (only re-creatable under certain load conditions cause by camunda.

Any ideas on where to look for potentially modifying the per-request timeouts?

Im still puzzled, shouldnt there be a lock visible in the job table?

R

Few items that came about this as changes / needs for the engine

  1. Doing REST API calls using GET /job or GET /job/:id / when listing jobs, should have a “Locked” property so you can see which jobs are currently locked by the engine / are currently being processed. (https://app.camunda.com/jira/browse/CAM-8192)

  2. Would be really nice to have a input parameter for HTTP-Connector to set a timeout. By Default it should be set to something like 60 seconds, and you can use the input parameter to override the timeout to a custom value. (https://app.camunda.com/jira/browse/CAM-8191)

  3. @camunda is there any specific reasons for allowing jobs to “run forever” ? and not having a forced default time limit on a job’s execution?

  4. We used a third party API-gateway to force the timeout rather than make changes to HTTP-Connector. But making changes to HTTP Connector appears to be the better longer term option.

  5. @camunda is there documentation about a Job being deleted, but it remains in the executor until restart? (aka deleting a in-flight job does not cancel it).

  6. Adding a response time return variable into http-connector would be very helpful. This would provide the response time it took for the request.

Feel free to raise a feature request. Instead of a boolean property, this should expose the current lock expiration time.

Feel free to raise a feature request.

It’s not possible to safely interrupt a thread in Java that is stuck in an infinite loop, waiting forever, etc.

To rephrase it: you would like to immediately cancel a job when it gets deleted (correct me if I’m wrong). Then the answer is the same as for question 3. Not doable, especially not in cluster setups.

Not sure if this is generally useful. What would you do with that variable in your example?

The response times on http-connector was more of a side feature that is useful for generating related incidents / indicators of issues. (If response times start to get too long it can be easily logged ro later intervention/review).

Would this scenario not warrant a wrapper around a job so that infinite loops, waiting forever, etc cannot occur? So that even if a loop does occur as a job, the higher level abstraction will timeout.

Feature requests raised

(https://app.camunda.com/jira/browse/CAM-8192)
(https://app.camunda.com/jira/browse/CAM-8191)

If you have a concrete idea, please provide some code. I don’t know how to build such a thing in a good way.

Note that the job’s lock expiration time is already such a timeout. This ensures that the job will be triggered again, but it does not cancel the ongoing execution. So if you have an infinite loop in a JavaDelegate or similar, all the job executor’s threads will be busy after some time.

Thanks @thorben. may have some time in the fall to have someone look at this.

@thorben 4 Year bug! @StephenOTT put in the time to debug and even submit features requests for solutions which were ignored.

Might I ask why it has been ignored for 4 years when many people have reported this as a problem?

As Stephen says in this thread, EVERY TASK stops working and the only solution I have found is a complete wipe and reload.

I personally have had to completely wipe and reload Camunda about 4 times in the past year we have tried to evaluate Camunda as a BPM solution, however I’m sure you would agree that would be hard decision for anyone to move forward with a solution having that bad of a track record.

For a platform that is supposed to be a rock-solid at enforcing business processes and rules that requires reloading ever few months falls quite short of a reliable platform.

Below are just the first few post of a search for “Job Executor Stuck” forum search.

One
Two
Three
Four
Five
Six

@GChester1 let us calm down a bit here… the issue appears related to the http connector, and clearly the connector and connectors in general have been given a very low priory for maintenance or new development by Camunda (the company), but nothing has stopped others from sourcing and fixing the problem: no one in the community has fixed it either.

There are multiple solutions to the problem at the community level, and there is always paid support you can get from Camunda or other providers.

Lots of options to solve your problem.
No need to be singling out developers.

5 Likes

@thorben was clearly the person who was working on this before and I’m simply asking why it has been allowed to go this long unsolved.

I thought I handled it quite well, after wasting an entire year of my life evaluating this product.

When you have wasted the amount of time and resources for a bug that has been ignored for 4 years, yes there is a bit to get upset about.

If he is not the person who works on the Camunda team, then my apologies.

On the other hand, if he is on the Camunda team, he seems like the best place to start on the mystery of why this issue was dropped.

Hello @GChester1,

As of the time of writing, the platform has 929 open feature requests and 867 open bug reports (2504 and 2646 have been implemented and fixed over the years). We have to carefully prioritize our work and choose those tickets that matter most to us. https://jira.camunda.com/browse/CAM-8191 hasn’t been part of that so far.

With this experience and that you were not able to resolve this problem in another way (using external tasks, using a custom JavaDelegate, using another HTTP client library like JSoup as Stephen suggests, contributing a bug fix because Camunda is open-source), I honestly think that Camunda is not the right technology for you.

The job executor is not a closed system and can encounter various problem, sometimes related to Camunda components, sometimes to the code/systems our users integrate. Users need to understand its workings a bit and also how to diagnose problems. For that purpose, I have published an extensive blogpost in the past: The Job Executor: What Is Going on in My Process Engine? - Camunda. Maybe it will help you in the future.

I hope none of this comes across as aggressive, it’s just my honest plain view on your comments.

Cheers,
Thorben

3 Likes

I’m not sure you did handle this particularly well… There’s no need to take out your frustration on someone else, especially someone who really can’t be blamed for you own decision making.

An issue for sure, is that you apparently spent a year evaluating Camunda and somehow never came across the fact that we don’t suggest using connectors in a variety of scenarios. That could be your fault for not being able to google for our docs or ours for not being clearer about getting the message out there. Connectors are very basic and there are much better options out there - depending on what you’re looking to do. So consider a different solution.

1 Like

I could not agree more that bugs need to be prioritized.

I would have thought an issue that is causing quite a few people problems which is verifiable by a simple search of the forums would justify as at least a bit of priority. A simple search produced 227 results

This bug Shuts down the entire Executor, in many cases never to start again and people complain about it over and over in the forums.

Possibly could you guys have stayed teamed up with the original Activiti group if you are so far behind?

Awww, yes, I’m quite familiar with that document. Here is even a thread I posted almost 4 months ago trying to work through how to enable the debugger.

I might point out this thread is still unanswered to this day.

I’m so familiar with that Support Document after reading it at least a dozen times that I could probably give a Public talk on it.

It has never really helped us solve the issue, so if you guys are under the impression it is a work-around solution for everyone, I not sure it is.

Actually @Niall Your assumption could not be farther from the truth. As I just stated above, I’ve spent months trying to debug this problem with Camunda and hour upon hours searching Support Documents, Forums Posts, StackOverflow, Github, and about everything else I could think of. As mentioned, many of my posts about the subject have gone unanswered.

I have never heard of another option other than the JSoup. I originally did not want to use JSoup since it was an external resource; not internal to Camunda.

Adding more complexity to use External Resources only complicates platforms even more.

A simple search on any of the reviews sites will tell you users think Camunda is Overly Complex already so I wanted to keep things simple and internal.

I assume just about every single person who uses Camunda needs to connect to an external APIs, I wanted to use an internal feature of Camunda thinking it would be more stable. So I must have been wrong there.

@Niall @StephenOTT We are simply going to have to agree to disagree. I have even reread my post and I think it made perfect sense to go back to the source and deal with the person who originally helped diagnose the problem. I did not blame anyone, I simply ask questions of why this LARGE BUG has been ignored, when all my other posts were practically ignored, kinda like the original bug.

On the other hand, you guys seem quick to blame me and not take responsibility for ignoring a HUGE BUG that has cost so many of your loyal followers hours, days, and years of their lives.

@thorben did not seem offended in his response and no offense was meant, though I was trying to be direct.

Obviously, it worked and got your attention.

Please address the gaping wound, stop the bleeding of your loyal users and fix the Job Executor Bug!

It seems so simple, @StephenOTT is quite talented and has done the hard work.

Hi @GChester1,

Thanks for your extensive feedback. I think most points have been made in this discussion. I disagree with some statements of your last post, however I won’t argue with you as I have the impression that this will lead us nowhere.

Our decision to not prioritize https://jira.camunda.com/browse/CAM-8191 remains unchanged for now, but we will keep your points in mind for future planning. If you’d like to contribute a fix, we are happy to help.

It’s correct that I am not offended by your posts. That said, your use of language comes across as aggressive and does not help your concern.

I wish you all the best.

Cheers,
Thorben

1 Like

It got you guys to finally address the issue instead of ignoring like my posts over the last half-year; so I got what I wanted.

But of course, there will never be any admission of non-response or fault from the other side.

However, I’ll point out there was still no action taking, just a bunch of chest-puffing and posturing.

I’m not a developer anymore and just run the company and if that were not enough, I just do not feel yo u guys have provided enough details to dive into the Job Executor to fix this bug.

On the other hand dealing with Camunda the past year has really let us see the future of Camunda, but we already know the history quite well as I have done my homework.

I’ve seen project after project for over 30 years do just as the Activiti project did. Piwik just to name one, Redmine and many others. It is usually over greed and/or arrogance.

Hi @GChester1
Glad you got that little rant off your chest :slightly_smiling_face:
But I think I’m going to lock this topic now as I think it’s better I prevent you from embarrassing yourself further. :wink:

For the moment you’re welcome to ask any other questions you might have by starting a new topic, but please be aware that we have a code of conduct that I would hope you’d take some time to read so you understand that this is not a place where we’re going to accept you’re outwardly aggressive tone.

You can send me a private message if you have any further questions about my decision to end the conversation.