Very high activate job latency while load testing Zeebe

Alexey Vinogradov: Hello Team!
We have a problem when we are trying to do some load testing we have a very high Activate Job Latency (gRPC) in our dashboard. Does this mean that we have insufficient resources for our workers or gateways? We are running them with mostly default settings with our workers and gateways so what should we check to clarify this?

zell: Hey @Alexey Vinogradov can you ask you for more details like partition count, how many workers, what kind of load? How many instances are running/created etc.?

It is expected to have a high activation latency if the processing queue is too big. Example: Say you have only one or two partitions and create a lot of instances per second (~200 PI/s) and maybe have also lot of different tasks types or workers, so you also try to activate them at the same time. Each of the client actions are commands which are end into the partition log. An response for a command is only send if the processor reaches the corresponding command on the log and processed it. So more and more work you add to the log it would take longer for each to be completed. Does this make sense?

What kind of numbers you want to reach? Maybe I can give some hints in regards to scaling the partitions etc.

Alexey Vinogradov: Hey <@U6WCLLNGJ>, sorry for the late reply, I guess we just need more partitions :thumbsup: We will try this and come later in case this doesn’t work for us

Alexey Vinogradov: Hey, in our case the problem was in the threads count for processing. We increased amount of threads and that works for us.

zell: Thanks @Alexey Vinogradov for mentioning this. I created https://github.com/camunda-cloud/zeebe/issues/8741 maybe you have input on this. :slightly_smiling_face:

Alexey Vinogradov: Also, what do you think about some troubleshooting guides? They should be short, imperative instructions about some problems (without explanations but with links to them). For example:

  1. Topic: My cluster works not so fast as I expected
  2. How to fix:
    a. Enable monitoring:
    i. Enable metrics on Zeebe Brokers and Gateways
    ii. Deploy Prometheus to collect them
    iii. Deploy Grafana with Zeebe template
    b. Observe metrics:
    i. If the system metrics are high try to add more resources
    ii. Else:
    1. Look for the processing metrics…

zell: Yeah exactly this is something we discussed several times now internally. Unfortunately we had no time for it yet, but it would be awesome if we could provide something like that. \cc @Amara Graham @Felix Müller (Camunda) <@UAHNVBW8N>

Alexey Vinogradov: Cool :thumbsup:. I can help you with this because we definitely want to document such steps internally and it would be cool if the community will add such guides too (I mean we can miss some tools or something like this, but in the community could be a handful tool to easily detect or solve our problem)

zell: Sounds great ! Ideally we have this part of our documentation, but just as an idea maybe it make sense to start this as community project and we can extend that if we have more time. Like here https://github.com/camunda-community-hub or migrate that then later into the docs :man-shrugging:

Alexey Vinogradov: Sounds interesting! We will think about this

Note: This post was generated by Slack Archivist from a conversation in the Zeebe Slack, a source of valuable discussions on Zeebe (get an invite). Someone in the Slack thought this was worth sharing!

If this post answered a question for you, hit the Like button - we use that to assess which posts to put into docs.

Nicolas: As Zell mentioned, we don’t have time to write in depth guides, but we’d be happy to help devrel or others do that :slightly_smiling_face:

Thomas Heinrichs: Hey @Alexey Vinogradov - I like your idea about some troubleshooting guides. I already added it to our backlog. But it will take some time to get this done for sure :smile:.
Thanks a lot! :slightly_smiling_face:

zell: To get the stone rolling we could already create a repo where people can creates issues or PR’s to add content :slightly_smiling_face:

Thomas Heinrichs: Awesome idea! I will create one in the community-hub :slightly_smiling_face:
What name do you prefer “ccsm-troubleshooting-guide” or “zeebe-troubleshooting-guide”?

zell: I think the first one so we can also cover other topics :slightly_smiling_face:

Amara Graham: I really like the approach here! We have some topics in the backlog around troubleshooting, but no guides in the works. My recommendation is to write the guide in plain markdown, include any code snippets in that repo as well, and then bring it to my attention so we can incorporate it into the official docs at some point.

Thomas Heinrichs: Maybe we could route viewers of this repository starting from the readme to subfolders containing solutions to their problem. I am imagening something like a decision tree. :slightly_smiling_face:
Btw.- An empty project has been created: https://github.com/camunda-community-hub/ccsm-troubleshooting-guide
I will add a readme in a minute. Feel free to start creating issues. (I will post it to general once the readme is ready)

Thomas Heinrichs: Readme updated :white_check_mark:
@Alexey Vinogradov I added your problem as an exemplary issue. Feel free to add more! :slightly_smiling_face: https://github.com/camunda-community-hub/ccsm-troubleshooting-guide/issues/1