How to calculate cycle time from grafana dashboard

jgeek1 · March 6, 2023, 6:25am

I am running multiple benchmarking tests to determine the cycle time and throughput using the benchmarking tool when varying the taskCompletionDelay. For one of the tests with 7 brokers and taskCompletionDelay as 20 secs I get a throughput of 70 PI/s. I have 12 tasks being executed in the process run so it should be take more than 240 secs = 4 mins.

I would like to record the process execution time but I am not able to understand the “Process Instance Execution Time (avg)” panel on the grafana dashboard. The y-axis states 1.68 Bil secs as avg (screenshot attached). That doesn’t sound a correct interpretation. Also to my surprise the “Process Instance Execution Time” graph is the same irrespective of the taskCompletionDelay . I see the same graph for taskCompletionDelay values of 5, 10 or 20 secs.

Can someone help me to understand to how to calculate cycle time from the prebuilt dashboard?

Thanks.

korthout · March 9, 2023, 1:47pm

1.68 Bil secs is indeed an oddly high cycle time

I’m not sure what is causing this issue. Perhaps we can pinpoint the issue a bit. Which version of the grafana dashboard are you using? And did you make any changes to it, or is it identical to the one in the zeebe repo?

If you made changes, please share the Edit details of the panel.

Can you share screenshots of the other panels in the Latency tab, like Process Instance Execution Time, Job Life Time, and Overall Processing Latency?

Please note that there is another problem with strange data visualization raised recently. It may be worth having a look at this issue and trying out the mentioned workaround:

Intermittent invalid bucket values in Grafana · Issue #11814 · camunda/zeebe · GitHub

jgeek1 · March 20, 2023, 10:11am

Thanks @korthout for the reply and sorry for the delayed response. My zeebe-cluster went on a toss due to networking issues. Now its up and we re-ran the test to understand the cycle time. Below are the responses to your queries

No changes are made to the dashboard. We took a copy of it a month back and are using the same.

Please find the screenshots below

I went through the github issue. Though the outcome is similar but in my case the values (cycle time) aren’t 0 which the issue states too be.

Kindly let me know more ideas to troubleshoot this issue - It is kind of blocking us in determining the overall process execution times when they go beyond 10 secs.

Thanks.

Latency Panel Screenshots

jgeek1 · March 23, 2023, 10:02am

@korthout - any inputs on this? We are kind of stuck without getting to know the cycle time. Any directions would be helpful.

korthout · March 23, 2023, 10:36am

Sorry about the late reply.

Considering the low Job Activation Time and the high Job Life Time, it seems that most of the time is spent outside of the workflow engine.

Also note that all other metrics (like Overall Processing Latency, Record Processing Duration, and Batch Event Replay Duration) are generally low, which indicates the same thing. The workflow engine seems healthy and responsive. Having said that, some higher spikes in overall processing latency could be worrying. You may try to increase resources and see if that lowers this metric. But this should not be the main concern.

I’d argue that you should inspect your workers, and infrastructure. Are you sure your workers can keep up? Perhaps they slow down when you push the system as you described.

korthout · March 23, 2023, 12:04pm

To add to this. @lenaschoenburg pointed out to me that:

There was an old version of the grafana dashboard where the calculation for average was a bit weird and would often result in nonsensical values (> 1 billion seconds just doesn’t make sense). I’d suggest that the user looks at the 50th percentile (so median and not average). The calculation is much more stable there and does not result in these weird values.

That’s also something we did for our own dashboard: fix(monitoring): use percentiles to show process instance execution time · camunda/zeebe@1496f29 · GitHub

Perhaps that helps clarify the extremely high cycle time

jgeek1 · March 23, 2023, 1:04pm

@korthout - High cycle time is expected since I am running the benchmarking tool with high taskCompletionDelay. Each task in the workflow is taking 5, 10, 20 secs during each run.

My challenge is I am not able to find what is the avg cycle time of the process? The current dashboard isn’t giving right results. Any idea on this?

Thanks.

korthout · March 23, 2023, 1:08pm

My challenge is I am not able to find what is the avg cycle time of the process? The current dashboard isn’t giving right results. Any idea on this?

Yeah, like @lenaschoenburg mentioned, the average calculation is sometimes faulty. Exactly why that is the case is unclear to me, but it’s much better to look at the 50th percentile than the average. That calculation is much more stable and also provides a better insight, as the average is often a skewed metric by some extreme cases. If you want to specifically know those extreme cases, you can look at the 90th or the 99th percentiles with that same calculation.

Does that help?

jgeek1 · March 23, 2023, 1:29pm

Ok so it looks like I have an old grafana dashboard which has an avg panel. I need to take an update and check, right?

korthout · March 23, 2023, 1:34pm

Yes, you can also try it out directly, by editing the panel to this, but updating the dashboard entirely might be the best solution as it would also bring in any other latest changes the team has made to the dashboards.

jgeek1 · March 24, 2023, 9:06am

@korthout - I took the latest zeebe dashboard and reran the test. The process execution time panel shows a different outcome than earlier but it still seems incorrect (screenshot below).

The taskCompletionDelay is 5 secs and we have 12 tasks in our process. So the ideal cycle time should be around 60 secs but the graph below shows it as 10 secs.

Any idea what could be going wrong?

Thanks.

Zelldon · March 24, 2023, 10:31am

Hey @jgeek1

the last bucket in the histogram is 10 second, which means everything great than 10 second will end in the last bucket. This is the reason why it looks like this.

Greets
Chris

jgeek1 · March 24, 2023, 10:38am

Right @Zelldon. Can the dashboard be tweaked to increase the buckets so the cycle time can be calculated?

It is pretty normal to have processes that take more than 10 secs to execute, so it would be important to have the grafana dashboard display it someway.

Thanks.

Zelldon · March 24, 2023, 10:44am

Hey @jgeek1

no this is on the application level. Right now there is no way to change the bucket sizes, but you can write a different metric exporter and deploy this one. Because the complete magic regarding processing execution latency is done here zeebe/MetricsExporter.java at main · camunda/zeebe · GitHub

Hope that helps.

Greets
Chris

jgeek1 · March 24, 2023, 11:02am

Ok. Browsing through the other dashboards in the github repo I came across this benchmark dashboard. It has “Process Instance Execution Time (avg)” but no values are getting displayed (screenshots below). Not sure what is missing. Any idea?

Another option I thought of exploring was to see if this benchmarking tool dashboard could be used but it seems its not kubernetes compatible. This also seems to be leading towards a dead-end.

Zelldon · March 24, 2023, 3:11pm

The dashboard you are trying is for our weekly medic benchmarks, so internal use. I doubt that this brings you any value.