Pretty new to Camunda products, looking for some direction, hope my question is not a complete duplicate:-/
Background:
We are running a green field microservices environment. All microservices are allocated to bounded contexts and we paid attention to “loose/low coupling and high cohesion”, I.e. making BC’s highly independent & autonomous. Every bounded context controls which API’s to publicly expose outside the BC via a dedicated API gateway
All microservices expose REST API’s, offering both sync and async execution (in case of async, command messages are dropped onto Kafka and the sync REST response is just a receipt acknowledgement).
All microservices emit events to Kafka after successful processing. A bounded context contains both atomic capability services (=aggregate centric state machines) and micro orchestrations (which are ideally encapsulated in the BC,I.e. no/limited outbound calls)
Microservices are idempotent and support distributed sagas by exposing cancellation/compensation interfaces.
Having said that, beyond the “vertical” BC’s, we do have end-to-end business processes which cut across bounded contexts. In order to have visibility as well as clearly separating the capabilities vs the process (for flexibility reasons), we want to introduce cross-cutting orchestrators rather than relying on peer-to-peer choreography.
The target is for these orchestrators to “own” the overarching business process, but to merely submit work items/commands via the async REST interface of the individual services (I.e. no blocking). The orchestrator would then monitor Kafka to detect the events and to trigger the next activity. So in this case, we would need just two layers…the orchestrator and the individual services.
Zeebe looks great and I understand the scalability benefits through logs and event sourcing, but it seems that it requires an additional facade layer between the orchestrator/broker and the individual services. I.e. we would need to implement a job handler for every microservice which would poll the broker and forward “work items” to the service(s). So the trade off seems to be complexity vs scalability. In this case, I would rather compromise on scalability for now, but I am probably missing other aspects.
So in a nutshell, what are the trade offs of Camunda BPM vs Zeebe in an existing microservices environment where the services are decentrally owned/managed by different squads?
Hey Nick! Thanks for your detailed message and welcome to the Zeebe community
Actually there is one point I would love to comment: I think the orchestration process must be owned by some BC too, as it is clearly business logic. This might not change a lot on the technical architecture, as ownership of the orchestration workflow (= in the BC) can be separated from the physical deployment (on some Zeebe broker). I talked about this a bit in NYC:https://www.youtube.com/watch?v=91IvqsT10QU. One central question will be, how many workflow engines you want to operate (1 central? 1 per squad? 1 per microservice?). But as this seems not really to be your question I don’t want to stress it too much now
To answer your question: You should definitely go for Zeebe, if you need cloud-scale, run in a cloud-native environment (where e.g. a RDMS is a problem) or probably even want to use Zeebe itself as a managed cloud service (we currently build that offering). Despite that Zeebe supports polyglot environments better (C#, NodeJs, …) and has a couple of improved features around microservice orchestration (e.g. message buffering). We also work on an out-of-the-box Kafka Connect.
If these are not good arguments, I would recommend to go with Camunda BPM - as the feature set is simply bigger and more mature. If you leverage the so called External Tasks, your code will be pretty much the same as it would we with Zeebe. Architecturally you can build similar architectures with both.
You could have that JobWorker as part of your microservice itself, so it fetches the work directly from Zeebe (or Camunda). As we work with logical topic names only here, the microservice does not need any knowledge about the workflow itself.
Bonus: The handlers are typically relatively easy to write - don’t be afraid of it
First of all, I completely agree with the comment regarding ownership of the business process. There is very likely a difference between owning the individual services versus the end-to-end flow. Basically, we need to support multiple concerns, both from a vertical perspective as well as horizontally.
Secondly, besides the visibility aspect, most people seem to ignore the inherent gotcha’s in choreography. Yes, you can add subscribers without changing any other service, but this would be more like kicking off a completely new business process. As soon as you are changing the existing process by inserting an activity, you need to do more than one change (as you talked about). But it gets even worse. The event itself (=event name) is just one piece of data about what happened and that can be used for decisions. But all the data within the event can be used for intelligently “routing”. What if the sequence of business process steps depends on the customer tier? Now the PaymentService needs to subscribe to multiple events (e.g. OrderCaptured vs OrderShipped) in order to ensure that standard customers get charged BEFORE the shipment goes out, while premium customers only get charged AFTER the shipment. Especially in a world, where we want to personalise the user experience, I am worried that a choreography approach would lead to spaghetti code and actually more (rather than less) coupling.
Having said all of that, we are not going to change everything instantly. Getting visibility into choreography is merely the first step of taking control. Orchestration will be the second one.
My question was about step 1:slight_smile:
To provide more technical background:
All our services are running on top of a service mesh which provides significant benefits in terms of overall tracing and visibility in terms of “who requested what?”. It also makes it easier to secure the command interface using standard REST mechanisms. Supporting Kafka as an alternative command interface not only becomes hard to manage/secure but also hard to visualise with Istio.
Essentially, our backend service architecture supports both sync (for in-band blocking calls) and async processing (for out-of-band non-blocking calls), but all commands are PUSH and exposed as REST.
That means that there is no classic choreography in our environment like “Service 2 listens to event A from Service 1, analyses the event and decides to execute an certain action”. But there is ideally always an orchestrator sitting between Service 1 (issuing the event) and Service 2 (processing a command). That orchestrator in the middle is responsible for the decision logic of how to react to an event. It’s not the responsibility of the individual service to understand the end-to-end flow.
Orchestrator: Turning an event into a command (or multiple commands) according to a business process definition
Service: Turning a command into an event
Services and orchestrators are complementary and allow us to model the entire feedback loop of “Event → Command → Event → Command…”.
The only exception is if a service uses an event for data replication, i.e. update its internal materialised view with new data. But that would be side effect free as the service wouldn’t be allowed to make any outbound calls. So basically not part of the service landscape.
In our current model, it is essential that the orchestrator is capable of “conducting”, i.e. explicitly tell services to do something. Implementing the JobWorker within the microservices basically implies an indirection where the REST endpoints of the services are mapped into topic names that Zeebe writes to and the JobWorker reads from.
Means that services now support three interface technologies…REST, Kafka and Zeebe :-
Irrespective of the orchestration aspect, feeding events/from Kafka back to Zeebe would be interesting.
Understand that the connector is not production ready, but are there any inherent limitations to be aware of? Our events are on different topics, typically by eventType, unless ordering requires consolidation.
From an internal perspective, does that mean that by not implementing the JobWorkers and only feeding the events back to Zeebe, we are effectively just activating the “awareness” part of it? Still struggling to fully understand the interplay.
Or is Zeebe purely a facade to feed the events to Elastic from where Operate can correlate/visualise ?
Zeebe will not actively call REST services. I see 2 options:
You build some service-specific piece of code (the worker), which does the rest call (and the data input/output) mapping. This needs to run somewhere, so you might get an additional component to be deployed, but logically it is part of the service. So no additonal ouside interface.
Your build a generic REST “connector” as worker. You could configure that by using task headers (https://docs.zeebe.io/bpmn-workflows/service-tasks.html#task-headers). Then you could run that worker alongside Zeebe and invoke REST calls without any additional effort. But you have to think about what the connector will be able to do - and how to configure it.
Of course the target service could also directly subscribe to Zeebe - but I don’t add that as option here, as then you really get an additional way of communication in your architecture.
I am actually not completely sure about the unknown unknowns of the prototypical Kafka Connect, my gut feeling is that it should work. But we start working on a supported Kafka Connect soon and will investigate much more into it. Feel free to experiment with the existing prototype in the meantime!
From an internal perspective, does that mean that by not implementing the JobWorkers and only feeding the events back to Zeebe, we are effectively just activating the “awareness” part of it? Still struggling to fully understand the interplay.
Exactly, no Job Workers, just feeding in Events, but Zeebe is executing real workflow instances. By the way - you could also do that without Kafka and implement it via API directly (probably behind REST in your case).
Did I mention that we are working on a solution where you could feed in Events directly into https://camunda.com/products/optimize/? That could also allow to gain some visibility into the choreographies you have (without Zeebe being involved at all in that case). We are searching for pilot customers to validate some assumptions here at the moment - hint hint
sorry for the rather long silence, lots of things going on.
But we have now dipped our toes deeper into using Zeebe purely from an observation aspect, i.e. our starting point/baseline is a choreography solution. Conversion into orchestration is not feasible at the moment, but we will strongly consider it for new processes if there is a good fit/experience.
So the key concerns at this point in time is to conveniently monitor progress, while understanding how long things take and whether/where the process gets stuck, ideally with providing a dashboard view and feeding into some kind of alerting solution.
There were a couple of questions from our team coming up that we could use some input:
You mention here ( https://github.com/zeebe-io/kafka-connect-zeebe#configuration ) that currently the connector does not support schemas. Is there a plan to support Avro schemas? What will we need to do in order to support them? FYI, we are in the process to move all our Kafka topics towards Avro.
As we would like to first use Zeebe as an observer without pushing anything to Kafka, is there a way how to finish a workflow with error, in such a way that it shows nicely in the camunda operate dashboard? For instance, if some event doesn’t arrive within a minute, create an incident? So far, the best we could come up with was to filter instances which ended up in some specific state, but it doesn’t really shout out. Also, the filter for node doesn’t seem to persist after refresh even though it’s kept the URL (operate:1.1.0).
We found that Camunda Operate is for non-production use only, and an enterprise version is not yet available. When is it expected to be production ready? Or, what could we use as an alternative for production?
In terms of integration with alerts, do you have any recommendation where/how this could get plugged into the solution?
As we would like to first use Zeebe as an observer without pushing anything to Kafka, is there a way how to finish a workflow with error, in such a way that it shows nicely in the camunda operate dashboard? For instance, if some event doesn’t arrive within a minute, create an incident? So far, the best we could come up with was to filter instances which ended up in some specific state, but it doesn’t really shout out. Also, the filter for node doesn’t seem to persist after refresh even though it’s kept the URL (operate:1.1.0).
Subprocess with a boundary-interrupt timer that routes to a raise incident worker.
Hey Fellas,
This is some interesting conversation that you guys had above.
Kudos for the insights. Now i am wondering @monohusche were you able to get what you wanted a “Monitoring system”
So the key concerns at this point in time is to conveniently monitor progress, while understanding how long things take and whether/where the process gets stuck, ideally with providing a dashboard view and feeding into some kind of alerting solution.
As i am also looking for something similar?
I have explored zeebe with kafka with my AWS MSK setup, its working so far but i want to make it more intuitive as in set up alerts whenever thing fail or recover a workflow instance manually if something breaks like a microservice is down or kafka broker is not able to process messages.