Process definitions not visible in Operate

rkrzewski · February 22, 2023, 2:13pm

Hello,

I’m trying to figure out why some process definitions are not showing up in Operate after being successfully deployed to the broker using zbctl.

I have two clusters Zeebe installations on my own k8s cluster that do not exhibit this problem, and two installations on clients’ k8s that do: after deploying 7 BPMN process definitions to one of installations 2 processes showed up in Operate. After wiping broker disks and ES indices, restarting broker and operate and redeploying the processes one process showed up in Operate. In the second problematic installation no processes at all showed up in Operate.

Repeated deployments of the project definitions do not change the situation. I am confident that the processes are deployed correctly, because I can start them through the client API and my JobWorkers are getting invoked. No errors are visible in either zbctl output or in broker logs, even after pushing ZEEBE_LOG_LEVEL do DEBUG.

I was trying to figure out where Operate is reading process definitions from and my hypothesis is that it’s the process index in ES.
I can see that in the installations that do not exhibit the problem this index for the date when initial deployment happened contains 7 documents which is equal to the number of my BPMN processes. On the problematic installation where only one process in visible in Operate UI this index contains 1 document. I’ve also noticed that deployment index contains multiple of 7 documents, and the multiplier is the number of deployments performed on that day. This is true both for the installation that do and do not exhibit this problem.

I spent some time reading ElasticsearchExporter sources and I’m considering adding much more debug logging to it, but cooking up patched docker images and threading them through the deployment pipeline is a bit of work. Moreover I fear that the Process Record (which I expect to be generated on deployment of a new process) is not dropped at exporter level but rather not generated at all.

I’ll be grateful for any tips on solving this issue. I’m happy to provide information about configuration, logs etc.

Cheers,
Rafał

rkrzewski · February 23, 2023, 10:35am

I was able to recover one of the environments by wiping disks and reinstalling components one by one:

stopped all Zeebe components
wiped broker disks
deleted indices in ES
started Broker, waited for the cluster to be fully up
started Gateway
started Optimize, waited for schema migration to finish
started my own component that deploys BPMN processes on startup

After that, process definitions were visible in Operate as expected.

My current hypothesis is that starting everything at once can lead to data loss / damage despite k8s liveness and readiness checks and automatic retries.

Cheers,
Rafał