Memory profile for zeebe brokers

LarsMadMan · September 29, 2020, 7:29am

Broker version: 0.24.3
We run zeebe on kubernetes with 3 brokers and a gateway (Helm chart) - elasticsearch exporter

Previous to 0.24.3 we experienced alot of problems with our zeebe setup, where brokers would restart and not come back for a long time (days) - related to Restart takes too long or never completes when snapshot contains many files · Issue #5135 · camunda/zeebe · GitHub we believe.

We think the restarts were caused by kubernetes deleting the pods because of memory usage (4gib limit), and on restart it was suffering the issue above, causing the system to be for all purposes down. This as been reported in zeebe slack channel earlier.

With 0.24.3, we saw restarts being quick, but we don’t necessarily know how the system will respond once the pods get restarted by kubernetes - which they will be soon due to memory usage. Here is the grafana output for one of the brokers:

Some snapshots from the cluster: Current resource usage:

Uptime:

Currently we are running one workflow, with approximately 500-1000 instances per 24/h

The issue looks similar to: https://forum.zeebe.io/t/zeebe-broker-0-20-0-memory-usage-is-too-high/704
however, given its age, I would have hoped for it to have been addressed by now, based on the severity.

I will update the issue once the pods are restarted, and describe the behaviour we see (does it restart quickly or does it cause downtime like before)

Cheers,
Lars

LarsMadMan · September 30, 2020, 8:40am

Update 30/09

I realise the original post was a little bit difficult to interpret - what was the question posed?
The tl;dr was that we have suffered the issue where brokers wouldn’t come online after being restarted, which was reported fixed in 24.3. Even after 24.3 we noticed the memory usage usage for the brokers was always climbing, and feared that once the broker was deleted due to resource policies, the same issue would occur.

So - what have we observed:

Deleted the main broker (broker-1, the one with highest resource usage) instead of waiting for Kubernetes to kill the pod. This to ensure we could watch the restart. The really, really REALLY great news is that the broker was back up in less than a minute and there was no downtime.

While there is certainly a less than desirable memory usage, the fact that the system can restore itself quickly means the impact is neglible.

Zelldon · October 2, 2020, 11:31am

Thanks for updating @LarsMadMan

Might be related to one of this: https://github.com/zeebe-io/zeebe/issues/4812 or https://github.com/zeebe-io/zeebe/issues/3988

Greets
Chris

LarsMadMan · October 2, 2020, 11:43am

Thanks Chris - looks relevant, but hard to say for a layman . We don’t have long running workflows, they will typically run from start to finish in a few seconds. Looks like the issues are not resolved or currently worked on - are there any plans to reopen? From my point of view that would be very desirable. Should we potentially adjust memory limits in kubernetes to restart pods earlier than 4gib? I mean to avoid performance degradation when memory usages gets very high.

Regards,
Lars

LarsMadMan · October 26, 2020, 12:07pm

Update 26/10:

@Zelldon @deepthi

Restartet one of the brokers today, to see if we could get better resource usage.

We had profiles like this for all brokers:

zeebe zeebe-cluster-zeebe-1 73m 3928Mi

Now the restartet broker (0) is has been starting for over 2 hours, looking alot like the issue we had prior to 24.3 (we believed it fixed by https://github.com/zeebe-io/zeebe/issues/5135)

zeebe-cluster-zeebe-0 0/1 Running 0 146m
zeebe-cluster-zeebe-1 1/1 Running 0 26d
zeebe-cluster-zeebe-2 1/1 Running 0 15d

current top output:
zeebe-cluster-zeebe-0 232m 1025Mi
zeebe-cluster-zeebe-1 64m 3935Mi
zeebe-cluster-zeebe-2 746m 3796Mi

Last log:
020-10-26 09:30:46.679 [] [main] INFO io.zeebe.broker.system - Bootstrap Broker-0 [1/10]: actor scheduler
2020-10-26 09:30:46.739 [] [main] INFO io.zeebe.broker.system - Bootstrap Broker-0 [2/10]: membership and replication protocol
2020-10-26 09:30:52.960 [] [main] INFO io.zeebe.broker.system - Bootstrap Broker-0 [3/10]: command api transport
2020-10-26 09:30:53.545 [] [main] INFO io.zeebe.broker.system - Bootstrap Broker-0 [4/10]: command api handler
2020-10-26 09:30:53.582 [] [main] INFO io.zeebe.broker.system - Bootstrap Broker-0 [5/10]: subscription api
2020-10-26 09:30:53.666 [] [main] INFO io.zeebe.broker.system - Bootstrap Broker-0 [6/10]: cluster services
2020-10-26 09:30:54.676 [] [http-nio-0.0.0.0-9600-exec-1] INFO org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/] - Initializing Spring DispatcherServlet ‘dispatcherServlet’
2020-10-26 09:30:54.681 [] [http-nio-0.0.0.0-9600-exec-1] INFO org.springframework.web.servlet.DispatcherServlet - Initializing Servlet ‘dispatcherServlet’
2020-10-26 09:30:54.760 [] [http-nio-0.0.0.0-9600-exec-1] INFO org.springframework.web.servlet.DispatcherServlet - Completed initialization in 78 ms

So it has logged nothing for a good while, which was the same issue we saw in earlier versions…
Any ideas to what we can look at?

Regards,
Lars

LarsMadMan · October 27, 2020, 10:41am

2020-10-26 20:49:53.328 [] [main] INFO io.zeebe.broker.system - Bootstrap Broker-0 [10/10]: zeebe partitions

So the broker came back, 11 hours later…

LarsMadMan · October 27, 2020, 11:19am

However… after that one restart which took 11 minutes, restarting the broker again was pretty much less than a minute, and restarting the other two took only a few minutes as well. So it was almost like the initial restart “cleared up” alot …

Zelldon · October 29, 2020, 7:51am

Hey @LarsMadMan

sorry for the late response.

Would be interesting if you could check the file count in the snapshots of all Brokers. You mentioned you have three Brokers, but how many partitions do you have?

Greets
Chris

Zelldon · October 29, 2020, 7:51am

Do you have metrics regarding PVC or file descriptor usages?