Prometheus Metrics: Process Engine Plugin (contribution)

dscheinin · August 3, 2020, 2:09pm

Sure. What I have is a Spring Boot 2.2.7 app with an embedded Camunda 7.13 and our own libs versions (according to the factory standards). Right now just running locally.
I forgot to add more about the error I’m getting.

What went wrong:
Execution failed for task ‘:compileJava’.

Could not resolve all files for configuration ‘:compileClasspath’.
Could not resolve commons-logging:commons-logging:1.2.
Required by:
project : > org.camunda.connect:camunda-connect-http-client:1.4.0 > org.apache.httpcomponents:httpclient:4.5.12
project : > com.github.StephenOTT:camunda-prometheus-process-engine-plugin:v1.8.0 > org.apache.httpcomponents:fluent-hc:4.5.12
Module ‘commons-logging:commons-logging’ has been rejected:
Cannot select module with conflict on capability ‘logging:jcl-api-capability:0’ also provided by [org.springframework:spring-jcl:5.2.6.RELEASE(compile)]
Could not resolve org.springframework:spring-jcl:5.2.6.RELEASE.
Required by:
project : > com.github.StephenOTT:camunda-prometheus-process-engine-plugin:v1.8.0 > org.springframework:spring-core:5.2.6.RELEASE
Module ‘org.springframework:spring-jcl’ has been rejected:
Cannot select module with conflict on capability ‘logging:jcl-api-capability:0’ also provided by [commons-logging:commons-logging:1.2(compile)]

Do you need more information?

Regards,
Diego

StephenOTT · August 3, 2020, 2:14pm

From a quick look it is likely from: https://github.com/StephenOTT/camunda-prometheus-process-engine-plugin/blob/master/pom.xml#L124-L135 with major version incompatibility. (just a quick guess).

dscheinin · August 3, 2020, 2:17pm

I guess it too.
I’ll investigate a little bit more about the error. If I get a solution, can I make a PR to your solution?

Regards,
Diego

StephenOTT · August 3, 2020, 2:18pm

Sure. A quick test if is update those deps and “hope” there is no breaking changes

dscheinin · August 3, 2020, 7:55pm

It wasn’t necessary
I just had to add these lines in build.gradle

all {
    exclude module: 'httpclient'
    exclude module: 'commons-logging'
}

Those module are included in other dependencies.

Regards,
Diego

dscheinin · August 21, 2020, 11:44am

Hi @StephenOTT, after a couple of week I could adapt your implementation to Spring Boot 2.2.6 and Camunda 7.13.
If you’re interested, I could make a PR to your repo. But before, it’s necessary to talk about what is the best way to organize the code according to your standards.

Cheers,
Diego

StephenOTT · August 24, 2020, 1:40pm

If you want to post it in the repo as a WIP/Work in progress PR i would be interested to take a look.

Sandeep_Yalamarthi · March 24, 2021, 2:14pm

Hello @StephenOTT
Thanks for building this plug-in . We have integrated this with our camunda setup and has helped us to monitor the workflow metrics in a much better way. Are there any more features that will get added such as alert rules and alert configurations ?

StephenOTT · March 24, 2021, 9:32pm

@Sandeep_Yalamarthi can you give me some examples of features you are looking for ?

Most alerts I had imagined would be Prometheus specific configurations per implementation.

Sandeep_Yalamarthi · April 5, 2021, 6:49am

@StephenOTT We run an internal PAAS application with process engine embedded in the application on k8s in a single pod. It is observed that the pod/application is crashing frequently due to high number of asynchronous jobs(1500000) created in the background which means high number of incidents as well due to a faulty bpmn/workflow . These avoid these kind of DDOS attacks it is better if we can have an alert if the number of background jobs cross a certain threshold . Same applies for any metric that affects the application/engine health.

StephenOTT · April 5, 2021, 12:31pm

You should be able to set a threshold in grafana. You should not need anything special. What is missing?

StephenOTT · April 20, 2021, 11:20pm

A new plugin has been developed that is a replacement:

The new plugin leverages micrometer and springboot actuator.

You can use any of the supported micrometer monitoring systems.

Sandeep_Yalamarthi · October 4, 2021, 7:07am

We are running camunda 7.12 version in production. A lot of our bpmns are modelled with connector tasks making http calls to other services. To monitor all these business metircs we tried to to implement the earlier plugin.

StephenOTT/camunda-prometheus-process-engine-plugin with the default scrape frequency of 5s. But it was causing high cpu utilisation on db server and we had to disable the plugin.

Will integrating the latest plugin solve our issue. ? Have there been any reports of perfomance issues when integrated in a heavy load application with huge amount of data in the history database.
What are the ideal configurations to run this plugin??

StephenOTT · October 4, 2021, 10:54am

If you re running the queries every 5 seconds then you are querying the data see every 5 seconds. A high cpu load is expected if you are queuing for large amounts of data.

Which specific metrics are you running?

Sandeep_Yalamarthi · October 6, 2021, 1:00pm

Almost all of the metrics given by in the initial plugin. This is the list.

camunda_metric_activity_instance_start
camunda_metric_activity_instance_end
camunda_metric_executed_decision_elements
camunda_metric_job_successful
camunda_metric_job_failed
camunda_metric_job_acquisition_attempt
camunda_metric_job_acquired_success
camunda_metric_job_acquired_failure
camunda_metric_job_execution_rejected
camunda_metric_job_locked_exclusive
camunda_process_definition_stats_instance_count
camunda_message_event_subscription_count
camunda_signal_event_subscription_count
camunda_compensation_event_subscription_count
camunda_conditional_event_subscription_count
camunda_open_incidents_count
camunda_resolved_incidents_count
camunda_deleted_incidents_count
camunda_active_process_instance_count
camunda_active_user_tasks_count
camunda_active_unassigned_user_tasks_count
camunda_camunda_suspended_user_tasks_count
camunda_active_timer_job_count
camunda_suspended_timer_job_count

Although for few of the metrics i have done some customisations to add an extra label tenantid in the metrics . Example groovy snippet for customised counter metrics as below.


static {
    tenantsList =
            ProcessEngines.getDefaultProcessEngine()
                    .getRepositoryService()
                    .createProcessDefinitionQuery()
                    .list()
                    .stream()
                    .map(ResourceDefinition::getTenantId)
                    .collect(Collectors.toList());
}

tenantsList.forEach {
tenantId ->
long count = processEngine.getRuntimeService()
.createEventSubscriptionQuery()
.eventType("SIGNAL")
.tenantIdIn(tenantId)
.count();
    counter.setValue(count, Arrays.asList(tenantId, engineName));
}

StephenOTT · October 6, 2021, 3:28pm

I could recommend you disable 1, 2, and 11.

Always consider what level of detail you actually need to have visibility on. Each of those 24 items are all queries being executed, some with 1+N scenarios such as #11 where it gets a list of Defs and then for each def it does more lookups. This can be a lot of data to process, especially given your are executing every 5 seconds.

Sandeep_Yalamarthi · October 13, 2021, 4:41am

Thanks for the suggestions @StephenOTT . As of now we increased frequency to 15 mins and disabled few 1,2,11 and few of the custom metrics and things seemed to have stabilised. I am also wondering if there is any other way of getting the full telemetry of the engine without scraping the db approach. Can camunda push metrics to prometheus collectors while creating/invoking the resources itself.??