Zeebe is unreasonably slow with out-of-the box setup

Zelldon · November 25, 2022, 3:01pm

Hey,

thank you both for your answers and all the information you provided! I have some follow-up questions

Your broker config shows:

    "partitionIds" : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ],
    "nodeId" : 0,
    "partitionsCount" : 8,

I’m wondering how this can happen. Did you change the configuration in between?
Regarding the replication factor of 2, what was the reason for pick this count? I suggest to increase it to at least three. See also other posts regarding this:

Can you also share the configuration from the standalone gateway?
I would also be interested in the Helm values file if you have any?
Are you planning to migrate to the most recent version in the near future? We already have 8.1.x released (8.1.4 zeebe - 8.1.6 for spring-zeebe).

Cluster deployed to k8s on bare metal ( VM’s in the local data center)

Do you have insights into the Disk performance, IOPS?

Regarding metrics - we have metrics collected by Prometheus and showing in Grafana. Actually, there are dozens of them and I can not see any signs of slow processing. We can share any dashboards if it helps.

This would be really interesting. Are you using the Zeebe Dashboards? If this is the case could you share the panels under the latency section? Under “Processing” I would like to see the “Processing Queue Size”. A screenshot of the General View would be nice as well.

To test performance we’re using JMeter 5.5, so by 5 uses, I meant 5 threads.

Ok, this is interesting and something we have to take a deeper look I think.

Next we “release” this entity that triggers Zeebe process of a few workers, checks that it has passed, and launch the same process again, but with different values (similar to PUT in the REST world).

As far as I understand you create a process instance and verify whether the process instance is completed in this thread. Where are the workers running? In a different thread or on a different machine? Sorry if I have overread that.

but I think I can provide an anonymized version of it

Thanks that is great! Thanks for sharing the model that helps.

According to the model, this means you have in the worst case 14 tasks to complete, plus 6 sub-processes and conditional gateways, etc, which also takes processing time ofc. But as far as I understand your comment is that you have in your test scenario only three tasks to complete is this right?

Since we receive 1 request for processing approximately every second and processing takes 3-5 seconds I think we can estimate that we run 3-5 processes concurrently.

Thanks that helps.

Can you share in more detail what you’re doing in the sequence flow conditions? Any big computation with feel expressions? I can see in your example this some op in field_4 satisfies op = "my_condition_4" Not sure whether this is what you’re actually doing right now.

Since you are right now in benchmarking stage, would it possible to adjust the test/process to verify where the issue lies? Like removing the conditions or simplifying them. It would be interesting to me to understand whether this might be an issue with the variables.

just before sending a task to Zeebe metric timer is started and it’s stopped on response callback (method whenComplete() in java code)

Just for clarification. You send the CreateProcessInstance with await result to the Broker. you measure the start before and wait for the result correct?

For workers, you can see timing (please ignore 0.99 and 0.999 outliers, that seems like a DB timeout, not related to Zeebe)
How it was measured: timer is started as soon as possible in @ZeebeWorker method and stopped in the final block (basically on return to broker).

Ok, this is measuring your client code correct? So your workers take 0.5 sec? Or what unit is this?

Initial JSON contains about 30 fields on different nestedness levels and takes ~1.5KB. In later phases, it may grow. I don’t have exact numbers but a rough estimate is about 50-70 >fields (or about 200 lines of formatted JSON) that take 7-10Kb

Thanks. Ok please be aware that this might affect the processing, activation time and potentially also evaluation time of expressions. Did you try to filter out variables for your workers, this might help with the job activation. Useful when a worker doesn’t want or doesn’t need to see all variables.

Look forward to your reply.

Greets
Chris