Alexey Vinogradov: Hello Team!
We have a problem when we are trying to do some load testing we have a very high Activate Job Latency (gRPC) in our dashboard. Does this mean that we have insufficient resources for our workers or gateways? We are running them with mostly default settings with our workers and gateways so what should we check to clarify this?
zell: Hey @Alexey Vinogradov can you ask you for more details like partition count, how many workers, what kind of load? How many instances are running/created etc.?
It is expected to have a high activation latency if the processing queue is too big. Example: Say you have only one or two partitions and create a lot of instances per second (~200 PI/s) and maybe have also lot of different tasks types or workers, so you also try to activate them at the same time. Each of the client actions are commands which are end into the partition log. An response for a command is only send if the processor reaches the corresponding command on the log and processed it. So more and more work you add to the log it would take longer for each to be completed. Does this make sense?
What kind of numbers you want to reach? Maybe I can give some hints in regards to scaling the partitions etc.
Alexey Vinogradov: Hey <@U6WCLLNGJ>, sorry for the late reply, I guess we just need more partitions We will try this and come later in case this doesn’t work for us
Alexey Vinogradov: Hey, in our case the problem was in the threads count for processing. We increased amount of threads and that works for us.
zell: Thanks @Alexey Vinogradov for mentioning this. I created https://github.com/camunda-cloud/zeebe/issues/8741 maybe you have input on this.
Alexey Vinogradov: Also, what do you think about some troubleshooting guides? They should be short, imperative instructions about some problems (without explanations but with links to them). For example:
- Topic: My cluster works not so fast as I expected
- How to fix:
a. Enable monitoring:
i. Enable metrics on Zeebe Brokers and Gateways
ii. Deploy Prometheus to collect them
iii. Deploy Grafana with Zeebe template
b. Observe metrics:
i. If the system metrics are high try to add more resources
ii. Else:
1. Look for the processing metrics…
zell: Yeah exactly this is something we discussed several times now internally. Unfortunately we had no time for it yet, but it would be awesome if we could provide something like that. \cc @Amara Graham @Felix Müller (Camunda) <@UAHNVBW8N>
Alexey Vinogradov: Cool . I can help you with this because we definitely want to document such steps internally and it would be cool if the community will add such guides too (I mean we can miss some tools or something like this, but in the community could be a handful tool to easily detect or solve our problem)
zell: Sounds great ! Ideally we have this part of our documentation, but just as an idea maybe it make sense to start this as community project and we can extend that if we have more time. Like here https://github.com/camunda-community-hub or migrate that then later into the docs :man-shrugging:
Alexey Vinogradov: Sounds interesting! We will think about this
Note: This post was generated by Slack Archivist from a conversation in the Zeebe Slack, a source of valuable discussions on Zeebe (get an invite). Someone in the Slack thought this was worth sharing!
If this post answered a question for you, hit the Like button - we use that to assess which posts to put into docs.