Approach to batch processing

tomconnolly · February 22, 2018, 10:55pm

Hi

I’m currently looking for an approach to process a batch of ‘items’. Similar to classic batch order processing pattern shown.

Wondering whether to implement (write line items to DB, retrieve batch and mark as retrieved etc) or use Spring Batch exposed via http APIs or use Camunda batch extensions.

I don’t need to use Cockpit but am torn as to whether to invest in the batch extension.

Regards
Tom.

Markus · February 23, 2018, 7:54am

Hi @tomconnolly,

Camunda bpm custom batch extension would be a good solution for your use case.
I never used it myself but read a bit about it and I implemented a similar solution in my company which
works perfectly for such a use case.

Because I uses camunda batch in background you do not have to worry about chucks, retries, or exception handling.

I think it would be worthwhile to give it a chance.

Best regards,

Markus

tomconnolly · February 27, 2018, 1:00am

Thanks Marcus, plugin looks good but.
I’ve gone back to the author to determine, an approach to seperate into an add / submit, please see Sponsor wanted: Camunda Custom Batch Extension

StephenOTT · February 27, 2018, 3:25am

@tomconnolly whats your use case/reasoning for using the batch extension rather than the bpmn pattern in your first post?

tomconnolly · February 27, 2018, 10:59am

Hi Stephen

My preference is to use out of the box functionality or, in this case, an extension to the Camunda Batch functionality. There is the added benefit of cockpit support and the management of the batch data in the Camunda DB. Happy for guidance here however as this is the goal of this question.

An alternative would be to implement Java delegates with say Spring Boot with separate Orders DB, which follows a similar pattern to the Camunda Batch approach.

The response from the ‘Post Orders’ will provide a response. In the example I have not represented the a re-submission or approach to retries, timeouts or interventions. Needless to say it must survive a camunda instance crash, the system should not lose any data.

Ideally I was hoping to use the BPMN definition without coding. Previously I’d seen success using Camunda to orchestrate services using the OOTB connectors.

As to my NFRs, there are potentially 100k+ instances requiring processing thus I can spread processing of collection and batch processing across many instances, I’m more concerned about DB scalability, thus far I’ve only checked against 20k+ process instances in a PoC, using Postgres running on my laptop.

Appreciate you time and guidance here.

Regards
Tom.

StephenOTT · February 27, 2018, 5:32pm

@tomconnolly, but what is stopping you from creating the (nearly) exact copy of the bpmn pattern you shown in the first post? → Each order received by the Message start event waits for a message of completion. And Daily you execute your batch process.

something like this:

would be a few different ways to build the setup depending on your needs.
But essentially it’s: Each order is a process instance. You query for those instances and then send a message to each order instance to mark that they are part of a batch. (you could remove this step depending on your Retry scenario and data loss scenario). And once the batch is executed, you message each instance that was part of the batch.

This might also be a good pattern for the external task worker pattern.

tomconnolly · February 28, 2018, 12:06am

Hi Stephen

That’s it. I finally get it.

Using this approach it supports a Camunda restart. I.e. recreate the list in Get Batch, in Mark Process for Batch the list has been serialised to the DB for subsequent execution.

Again appreciate you taking the time to guide me.

Regards
Tom.

patrick.wunderlich · February 28, 2018, 7:09am

Hi Stephen and Tom,

nice Approach The second send complete message is maybe not needed, since camunda batch is async.
Or you have to poll in a second delegate until the batch is really finished. But I think there is no need to wait if the batch is created, since camunda whil then be responsible to finish them and they are in any case fail save.

Important is maybe also that the wait steps are really the last steps in the order process (or maybe just some own tiny process), where no other async step is behind. Otherwise you will have 100000 processes at once which should also get processed by the job executer. (Additional to the X batch jobs)

Regards,
Patrick

tomconnolly · March 9, 2018, 4:01am

Hi Stephen and Patrick

Put together a PoC to verify the solution.

Note I kept scheduler and batch start seperate just for testing.

Appreciate feedback.

Regards
Tom.

patrick.wunderlich · March 12, 2018, 4:05pm

Hi Tom,

looks nice!

But now you will create a batch for all active orders, no matter if they are waiting in the “Receive Batch ID” message event or not. And why do you need just the latest process version? Wouldn’t it be better to use the process key and the activityId of the message event in your process query?

And why do you want a max batch size?

Regards,
Patrick

tomconnolly · March 13, 2018, 9:36am

Hi Patrick

The batch process will take, in the example, 1000 running instances - configurable, add these to the batch, and send a correlation back to ‘Receive Batch Id’, using the process id. So if there were 5,000 waiting instances after the correlation there would be 4,000, the instances are not stored / persisted in the batch.

And why do you need just the latest process version?
Wouldn’t it be better to use the process key and the activityId of the message event in your process query?

Not sure I understand. Is this the GetBatchDelegate using the business key?
Is it possible to explain with an example please?

Regards
Tom.

patrick.wunderlich · March 13, 2018, 1:40pm

Hi Tom,

let’s say you have following running processes for order.bpmn:

OrderProcess v1, waiting at “Receive Batch ID”
OrderProcess v1, waiting at some task before “Receive Batch ID”
OrderProcess v2, waiting at “Receive Batch ID”
OrderProcess v2, waiting at some task before “Receive Batch ID”

Currently when executing GetBatchDelegate, you will create a batch for 3 and 4.
Because you just use the latest process version, and you don’t filter on the activity.

When using the activityId as filter criterium, you just will get 1 and 3, what would be more correct, or not?

Just saw that you just have one process which contains all subprocesses? I think it would be better to move the batch stuff into an own process.

Regards,
Patrick