Long-running (months, years or maybe forever) workflows

GuSuku · January 25, 2020, 12:58pm

Hi,
I have read through docs and the forum but I could not find resources regarding using Zeebe to coordinate long-running workflows that can potentially run for months, years (and maybe forever). Can someone explain or point our resources around:

Is this one of Zeebe’s core use cases?
How does Zeene handle such workflows (freeing up resources from workflows that are waiting for triggers from clients, version migrations etc)
How is Zeebe’s performance for such use cases?

Thanks!

jwulf · January 25, 2020, 2:09pm

Hi @GuSuku,

The only way to know this for sure is to run a workflow for a long period of time, and see what happens.

And since Zeebe 0.22 has been out for less than two weeks, no-one knows.

I ran workflows that lasted over a week on an earlier version of Zeebe. My observations were that memory use increased, and especially restarting nodes used way more memory when they had to reconstruct the current state.

We don’t recommend using Zeebe at this point for long-running workflows. With a database storing the engine state, such as the Camunda engine does, the number of long-running processes just increases the db size, which is not a big deal. And it is a table row mutation.

Zeebe is optimised for through-put, so it uses an in-memory state, and the event log is an append-only log. This is faster and more scalable than a database, however, it is more memory-intensive.

The event log on disk is pruned as snapshots are taken and as workflows complete, but the snapshot size and the memory use increases as the number of concurrent processes on a node grows.

In my case, I reworked my processes to be shorter-lived (around 24 hours) to reduce the memory pressure. With a low number of long-lived processes YMMV, but that is not the use case that Zeebe targets. It is designed for a high volume of short-lived processes.

For long-lived processes like the ones you mention, I would lean toward the Camunda engine. It is also extremely scalable (although not to the theoretical throughput limits of Zeebe). Some users put through millions of processes per day on the Camunda engine.

GuSuku · January 26, 2020, 1:17pm

Thanks @jwulf !

I am trying to understand where is the memory pressure coming from and how fundamental is it to zeebe’s design: Logs are deleted as soon as an event is processed and committed to a snapshot. And snapshots are flushed to disk periodically. How is this flush determined and are they pulled back when needed? I can see message subscribers (~ message correlation) taking up memory. Do they overflow to disk as well? What else am I missing?

I liked Zeebe for its easier to get started vibe and cloud nativity, but may have to look into Camunda engine.

jwulf · January 26, 2020, 3:40pm

Snapshots are generated periodically. 15m by default and configured in zeebe.cfg.toml.

The snapshot is a projection of the current state of the broker from the event log, and serves to speed up a restart or fail-over as the new leader only needs to project the state out of the event log from the last snapshot.

The snapshot is the in-memory computed state that is held in RocksDB on the current partition leader.

Zeebe works over a stream of workflow events. The events get committed to a replicated append-only log and the current state of workflows can always be reconstructed by projecting from that log. As workflow events are consumed into snapshots and streamed through configured exporters they can be removed from the log. Slow exporters and long snapshot periods cause the on-disk event log to grow as segments cannot be deleted until they are exported and snapshotted.

Lots of long-running processes can cause the snapshot and in-memory state size to grow because they have current state for a long time. But that is a function of the total number of current processes.

I’m not sure if the in-memory representation of a long-running process holds more information than that of a shorter-lived one. Maybe @Zelldon or @deepthi can say?

GuSuku · January 27, 2020, 1:50pm

Thanks @jwulf. While we wait for others to chime in, I have a few clarification questions.

So snapshots are generated (every 15m) by compacting logs. But Bernd states in one of his blogs [1] [2] that “RocksDB also flushes the snapshot state to disk”. I am trying to understand how this is done and if this will alleviate memory pressure for long-running workflows.

Does it use RocksDB’s tiered storage to offload/flush from memory to disk? How modular is RocksDB integration? Can I use RocksDB-Cloud instead, for example, to enable caching to S3 as well?

jwulf · January 27, 2020, 2:19pm

The answer to all of these questions are in the second blog article of Bernd’s that you linked:

Zeebe writes the log to disk and RocksDB also flushes its state to disk. Currently this is the only supported option. We regularly discuss making storage logic pluggable — for example support Cassandra — but so far we’ve focused on file system and it might even be the best choice for most use cases, as it is simply the fastest and most reliable option.

In the first sentence, the first is the event log, the second is the snapshot.

This is a good place to look: Issues · camunda/zeebe · GitHub

The Zeebe engineering team have good discipline around documenting their work in issues and PRs, and I frequently use the GitHub issues to understand aspects of Zeebe.

Update: Right before that, Bernd says:

We decided to do things differently. As soon as we have completely processed an event and applied it to the snapshot, we delete it right away. I’ll come back to “completely processed” later on. This allows us to keep the log clean and tidy at all times, without losing the benefits of an append-only log and stream processing — as described in a minute.

This suggests that all events in the process are in the snapshot. It is a question then of how the events are applied to the RocksDB and how it is structured. If it is mutable state, then long-running processes should not cause it to grow disproportionately. If it grows proportionate to the number of events in a process, then it will grow over time for long-running processes. Let me see if the RocksDB structure source code is easy to grok…

Further update: I looked through the source code and can’t figure out how data is persisted to RocksDB, apart from that it uses a ZeebeDB abstraction. This looks like where rehydration takes place, which gives some idea of the shape of the data: zeebe/engine/src/main/java/io/zeebe/engine/state at 958f98cf7e7085b59e3d68bbaa4244193655ecbd · camunda/zeebe · GitHub

There is a child-parent relationship stored for variable scopes. If you have something like a repeating sub-process in a long-running process - depending on whether completed iterations are kept in the store, this could cause state explosion. If that is the case, the new Call Activity would alleviate that.

deepthi · January 27, 2020, 5:36pm

A growing state does not cause memory issues because Rocksdb pushes in-memory state to disk. The limitation that we have now is mostly because we replicate snapshots to other replicas periodically. If the snapshot is large it needs more network bandwidth and can affect the normal processing time. How large snapshot a zeebe cluster can handle really depends on the infrastructure that it is running.

Zelldon · January 27, 2020, 8:01pm

Additionally to @deepthi’s comment I would like to say that we do not care whether they are long-running and short running there is no difference in the state. The state gets only bigger when you have more process instances in parallel, which is more likely on long running process instances.

Hope this helps.

Greets
Chris

jwulf · January 27, 2020, 11:26pm

So RocksDB does something like paging to virtual memory when it hits an upper limit?

deepthi · January 28, 2020, 9:37am

@jwulf RocksDb internally stores data in a LSM tree. The keys are stored in different levels and the lowest levels are usually kept in memory. The keys in lower level are pushed to higher levels during “compaction”. I can’t remember the details, but typically most frequently accessed or most recently accessed remains in the lower level enabling faster access.

GuSuku · January 29, 2020, 1:19pm

@Zelldon Thanks, that is indeed true. I will update my OP to make terminology accurate. (Unfortunately I cannot edit my OP any longer)

@jwulf Thanks for those resources! Will check them out.

@deepthi Thanks for clarifying that. Workflows with many long-running instances usually have sparse transitions, and the change delta will be very small. Can snapshot replication copy just this delta?