Sarath Kumar: Hi Team,
I’m running zeebe cluster with 3 brokers, getting this error “io.zeebe.client.api.command.ClientStatusException: Reached maximum capacity of requests handled” while doing load test, with 200 requests in parallel. Kindly help me to find the issue.
Josh Wulf: What’s the issue? You want it to do more?
Sarath Kumar: Yes, we have added 1000 workflows, and 200 instances per second are getting added randomly into this worflows.
This was running fine, until there is no transition of instances between workflows.
Sarath Kumar: I have a use case where my instance1 while processing, will add instance to workflow2. So while doing this process from worker, we are getting the mentioned exception
Sarath Kumar: yes,
Sarath Kumar:
WorkflowInstanceEvent workflowInstanceEvent = (WorkflowInstanceEvent)this.zeebeClient.newCreateInstanceCommand().bpmnProcessId(bpmnProcessId).latestVersion().variables(variables).send().join();
return workflowInstanceEvent.getWorkflowInstanceKey();
}
Josh Wulf: So, are you saying that “a Worker for a Service Task in Workflow 1 creates an instance of Workflow 2”
Sarath Kumar: yes, correct
Josh Wulf: And how are Workflow 1 instances created?
Sarath Kumar: They are created via separate service
Josh Wulf: So for each workflow 1, you create one instance of workflow 2?
Sarath Kumar: It depends on the use case. Some of the case it will end in workflow 1 itself. But for some use cases, it will create instances in any of the other desired workflow.
Sarath Kumar: We are seeing this exception randomly, while creating this new instance from worker
Josh Wulf: So I hear two different things
Josh Wulf: One is an error message in the worker, appearing randomly, that the max # of parallel requests are in flight
Josh Wulf: The other is that workflows no longer progress on the brokers
Josh Wulf: Is that correct?
Josh Wulf: Or by there is no transition of instances between workflows.
do you mean that workers can no longer create new instances of workflows?
Sarath Kumar: yes, I have added an instance to worflow1, and workflow1 trying to add new instance to workflow2, while doing this at the rate of 200req/sec, we are seeing this error "“io.zeebe.client.api.command.ClientStatusException: Reached maximum capacity of requests handled”
As the worker is throwing an error, workflow instance moves to failed state
Josh Wulf: Uh-huh
Josh Wulf: 3 brokers. One gateway? How many workers? What resources do the brokers have? RAM, CPU?
Josh Wulf: Probably the gateway is saturated
Sarath Kumar: yes, 3 brokers with 1 gateway. Let me pull the resources
Josh Wulf: Also, what does zbctl status
show you?
Josh Wulf: And what is in the Gateway logs?
Josh Wulf: That message is emitted here: https://github.com/zeebe-io/zeebe/blob/a417a8fee9c93856420ed59ab10f2517a925a081/broker/src/main/java/io/zeebe/broker/transport/commandapi/CommandApiRequestHandler.java#L140
Josh Wulf: That will tell you in the logs what partition:
Josh Wulf: "Partition-{} receiving too many requests. Current limit {} inflight {}, dropping request {} from gateway"
Sarath Kumar: yeah sure, let me pull the logs from gateway
Josh Wulf: @UEHR7VBBP do you need to run at a log level other than INFO to see that message in the logs?
Sarath Kumar: 2020-02-04 06:28:20.803 [] [main] INFO io.zeebe.gateway - Starting gateway with configuration {
“network”: {
“host”: “10.177.71.93”,
“port”: 26500,
“minKeepAliveInterval”: “30s”
},
“cluster”: {
“contactPoint”: “10.177.71.226:26502”,
“maxMessageSize”: “4M”,
“requestTimeout”: “15s”,
“clusterName”: “zeebe-cluster”,
“memberId”: “gateway”,
“host”: “0.0.0.0”,
“port”: 26502
},
“threads”: {
“managementThreads”: 1
},
“monitoring”: {
“enabled”: false,
“host”: “0.0.0.0”,
“port”: 9600
},
“security”: {
“enabled”: false
}
}
Sarath Kumar: Above are the ones available in log., Not able to see that partition error message
Josh Wulf: and zbctl status?
Sarath Kumar: Cluster size: 3 Partitions count: 16 Replication factor: 3 Brokers: Broker 0 - 10.177.71.226:26501 Partition 1 : Leader Partition 2 : Leader Partition 3 : Follower Partition 4 : Leader Partition 5 : Leader Partition 6 : Follower Partition 7 : Leader Partition 8 : Leader Partition 9 : Follower Partition 10 : Leader Partition 11 : Leader Partition 12 : Follower Partition 13 : Leader Partition 14 : Leader Partition 15 : Follower Partition 16 : Leader Broker 1 - 10.177.71.39:26501 Partition 1 : Follower Partition 2 : Follower Partition 3 : Follower Partition 4 : Follower Partition 5 : Follower Partition 6 : Follower Partition 7 : Follower Partition 8 : Follower Partition 9 : Follower Partition 10 : Follower Partition 11 : Follower Partition 12 : Follower Partition 13 : Follower Partition 14 : Follower Partition 15 : Follower Partition 16 : Follower Broker 2 - 10.177.71.131:26501 Partition 1 : Follower Partition 2 : Follower Partition 3 : Leader Partition 4 : Follower Partition 5 : Follower Partition 6 : Leader Partition 7 : Follower Partition 8 : Follower Partition 9 : Leader Partition 10 : Follower Partition 11 : Follower Partition 12 : Leader Partition 13 : Follower Partition 14 : Follower Partition 15 : Leader Partition 16 : Follower
Josh Wulf: Too many partitions
Josh Wulf: With three brokers you want to have three partitions
Josh Wulf: More partitions == more work
Josh Wulf: Which is cool if you have more brokers, because then they can each get some
Josh Wulf: Broker 1 is not doing any processing
Josh Wulf: two of the nodes are doing all the work
Josh Wulf: and they are having to manage 16 partitions while they do it
Josh Wulf: too much overhead
Josh Wulf: I would start there: 3 partitions, try it again, see where it fails
Josh Wulf: then restart the brokers in TRACE log level and do it again, and study the logs
Josh Wulf: and turn on Prometheus metrics and start building graphs to see where the resource pressure is on each machine and in the system
Sarath Kumar: Just for clarification, will the partition count affects the request count.
Josh Wulf: Also, this: https://zeebe.io/blog/2019/12/zeebe-performance-profiling/
Josh Wulf: It will affect the amount of meta-work that your brokers are doing
Josh Wulf: managing a partition is not servicing requests, so they become resource starved
Sarath Kumar: Yeah sure, i will reduce the partition, but this will reduce the processing speed right?
We have already done the prometheus setting, if possible sharing me the metric to calculate resource pressure
Josh Wulf: How much? I don’t know. Measure it and let me know - I’m interested to know
Josh Wulf: How do more partitions increase the processing speed?
Josh Wulf: They do if they are spread across more brokers
Josh Wulf: If you had 16 partitions and 16 brokers then of course it will be faster
Josh Wulf: But if you have 3 men, how does giving them 16 jobs get your house built faster?
Josh Wulf: You need 16 people doing 16 jobs for that to be effective
Sarath Kumar: oh okay sure, will try this configuration.
And what’s the max parallel request count will gateway can handle?
Josh Wulf: At the moment you have two people doing 16 jobs, and one - Broker 1, watching them
Josh Wulf: It depends
Josh Wulf: Put it on a Raspberry Pi like @USX4VT2AC and you’ll get one metric
Josh Wulf: Put it on a 16-CPU machine with 256GB RAM and you’ll get another one
Josh Wulf: The only way to know is to measure and adjust parameters
Josh Wulf: That’s what you are doing now - science
Sarath Kumar: yeah sure, we will give a try on this configurations
Josh Wulf: devise experiments, take notes, make hypotheses
Josh Wulf: That’s how Mendel discovered genetic inheritance. Careful observation and experimentation
Josh Wulf: And read that blog post - it is exactly about how to do this - performance profiling
Sarath Kumar: Yup sure !!
Josh Wulf: I would run it with partition counts from 3 to 5, replication 0, 2, and 3
Josh Wulf: That gives you a matrix
Josh Wulf: That’s nine combinations
Josh Wulf: Measure throughput, end-to-end latency and mean workflows to failure
Josh Wulf: Then do the same thing again with different sized instances
Josh Wulf: I’d automate that. because if you try two different instance sizes, that’s already 18 runs
Josh Wulf: and if you automate it, you can just as easily do three sizes for 27 runs
Josh Wulf: And then when a new version comes out, before you deploy it, you run the same test suite again, and BAM! You’re a pro
Josh Wulf: Then you give a talk at a conference and get hired by IBM
Josh Wulf: I saw two guys talk at http://DevConf.CZ|DevConf.CZ last week
Josh Wulf: They spent six months profiling databases on Cloud providers
Josh Wulf: https://twitter.com/sitapati/status/1221389377340956672
Sarath Kumar: oh wow, well motivated to complete this
Josh Wulf: https://twitter.com/sitapati/status/1221399049233928193
Josh Wulf: You got this
Josh Wulf: @UTM6C2C3H How do I performance profile Zeebe for my use case?
Note: This post was generated by Slack Archivist from a conversation in the Zeebe Slack, a source of valuable discussions on Zeebe (get an invite). Someone in the Slack thought this was worth sharing!
If this post answered a question for you, hit the Like button - we use that to assess which posts to put into docs.