Failed to send activated jobs to client (only for one job type)

neki · December 21, 2022, 1:03pm

Hello! I have test stand.

Java app with
"io.camunda:spring-zeebe-starter:8.1.12"
"io.camunda:zeebe-client-java:8.1.5"

Docker containers
camunda/zeebe:8.1.5
elasticsearch:7.17.0

And it works very good! But some time later only one job type can’t start handler

Java client log:

io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: Time out between gateway and broker: Request ProtocolRequest{id=32953, subject=command-api-1, sender=0.0.0.0:26502, payload=byte[]{length=1157, hash=-1568037875}} to 0.0.0.0:26501 timed out in PT10S
	at io.grpc.Status.asRuntimeException(Status.java:535) ~[grpc-api-1.49.1.jar!/:1.49.1]
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487) ~[grpc-stub-1.49.1.jar!/:1.49.1]
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) ~[grpc-core-1.49.1.jar!/:1.49.1]
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) ~[grpc-core-1.49.1.jar!/:1.49.1]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:744) ~[grpc-core-1.49.1.jar!/:1.49.1]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723) ~[grpc-core-1.49.1.jar!/:1.49.1]
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.49.1.jar!/:1.49.1]
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) ~[grpc-core-1.49.1.jar!/:1.49.1]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[na:na]
	at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]

camunda/zeebe output log:

2022-12-21 12:26:17.028 [ActivateJobsHandler] [Broker-0-zb-actors-1] DEBUG
      io.camunda.zeebe.gateway - Failed to send 2 activated jobs for type admin-module-get-custom-message-task-z (with job keys: [2251799930062329, 2251799930062332]) to client, because: Failed to send activated jobs to client
2022-12-21 12:26:22.160 [] [grpc-default-worker-ELG-5-1] WARN 
      io.grpc.netty.NettyServerHandler - Stream Error
io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place
	at io.netty.handler.codec.http2.Http2Exception.streamError(Http2Exception.java:173) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$FlowState.cancel(DefaultHttp2RemoteFlowController.java:481) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$1.onStreamClosed(DefaultHttp2RemoteFlowController.java:105) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.DefaultHttp2Connection.notifyClosed(DefaultHttp2Connection.java:357) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.removeFromActiveStreams(DefaultHttp2Connection.java:1007) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.deactivate(DefaultHttp2Connection.java:963) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultStream.close(DefaultHttp2Connection.java:515) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultStream.close(DefaultHttp2Connection.java:521) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.Http2ConnectionHandler.closeStream(Http2ConnectionHandler.java:628) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onRstStreamRead(DefaultHttp2ConnectionDecoder.java:444) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onRstStreamRead(Http2InboundFrameLogger.java:80) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readRstStreamFrame(DefaultHttp2FrameReader.java:509) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:259) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:159) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:41) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:173) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:393) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:453) ~[netty-codec-http2-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:519) ~[netty-codec-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:458) ~[netty-codec-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:280) ~[netty-codec-4.1.82.Final.jar:4.1.82.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[netty-transport-4.1.82.Final.jar:4.1.82.Final]

Also I started getting periodic background log

Java client background log:

2022-12-21 12:23:28.964  INFO 7 --- [-worker-ELG-1-2] io.grpc.internal.AbstractClientStream    : Received data on closed stream
2022-12-21 12:23:28.964  INFO 7 --- [-worker-ELG-1-2] io.grpc.internal.AbstractClientStream    : Received data on closed stream
2022-12-21 12:23:28.964  INFO 7 --- [-worker-ELG-1-2] io.grpc.internal.AbstractClientStream    : Received data on closed stream
2022-12-21 12:23:28.964  INFO 7 --- [-worker-ELG-1-2] io.grpc.internal.AbstractClientStream    : Received data on closed stream
2022-12-21 12:23:28.964  INFO 7 --- [-worker-ELG-1-2] io.grpc.internal.AbstractClientStream    : Received data on closed stream
2022-12-21 12:23:28.964  INFO 7 --- [-worker-ELG-1-2] io.grpc.internal.AbstractClientStream    : Received data on closed stream
2022-12-21 12:23:28.964  INFO 7 --- [-worker-ELG-1-2] io.grpc.internal.AbstractClientStream    : Received data on closed stream
2022-12-21 12:23:28.964  INFO 7 --- [-worker-ELG-1-2] io.grpc.internal.AbstractClientStream    : Received data on closed stream
2022-12-21 12:23:28.964  INFO 7 --- [-worker-ELG-1-2] io.grpc.internal.AbstractClientStream    : Received data on closed stream
2022-12-21 12:23:28.964  INFO 7 --- [-worker-ELG-1-2] io.grpc.internal.AbstractClientStream    : Received data on closed stream
2022-12-21 12:23:28.964  INFO 7 --- [-worker-ELG-1-2] io.grpc.internal.AbstractClientStream    : Received data on closed stream
2022-12-21 12:23:28.964  INFO 7 --- [-worker-ELG-1-2] io.grpc.internal.AbstractClientStream    : Received data on closed stream

This fail is actual only for one job type.
Others jobs work correctly.
Also this job works correctly when camunda/zeebe and elasticsearch starts without old data.

How can I get more info about why job can’t start?

Thank you!

neki · December 22, 2022, 8:46am

Any thoughts on how to get more information would be very helpful

neki · December 27, 2022, 8:06am

I renamed the job type and it helped.

But I am very worried about the case.
The broker just doesn’t run a handler with the specified job name and I can’t get any details. It looks like the broker has a black list.

Does anyone have any ideas on this issue?

korthout · January 3, 2023, 10:29am

Hi @neki Sorry that you had to wait so long on a reply.

This indicates that data could not be written between the client and the gateway because the gRPC stream is closed. Typically, this happens when the client has restarted.

This is another indication of the same, but it provides different information. Now we see that two jobs were activated, but the gateway could not send them to the client.

I expect your client to close the connection to the gateway for some reason. You should find out why the client closes the connection. As I mentioned, the most common reason is that users attempt to run the workers as short-lived but do not configure them this way.

Note that workers use long polling to reduce request frequency, which keeps the connection open if there are no jobs available at the time, until new jobs become available. If you want to run your workers as short-lived, you should disable long polling. However, I recommend running the workers indefinitely instead.

neki · January 13, 2023, 5:32am

Thank you! This can be very helpful! I’ll try to explore this direction.

Although I’m still confused by some hidden magic.
I want to be able to get more info on why the client closed the connection. But client doesn’t give more info.

Note.
I restarted the client app (client is stateless).
But the problem remains after restart: the same job types is failed to submit.
Thus, I conclude that the broker’s data is the reason for the closing connection.

korthout · January 13, 2023, 8:53am

I restarted the client app (client is stateless).
But the problem remains after restart:

A restart of your client would close the connection.

neki · January 13, 2023, 12:50pm

This is clear.
But after restart a client creates a new connection.
And all the workers, exclude the same, can works and do new jobs

neki · January 26, 2023, 12:12pm

It seems we have found something that will help solve such problems in the future.

jgeek1 · February 14, 2023, 10:18am

@korthout - We are experiencing exactly same errors on the zeebe client which is the benchmarking tool and on the zeebe gateway we get the Stream error warning

WARN
io.grpc.netty.NettyServerHandler - Stream Error
io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place

In our case since we are running the benchmarking tool I don’t think it (zeebe-client) closes the connection to the gateway. Let me know if there is a way to confirm that the connection is indeed closed.

We see this error only after a certain duration of the benchmark tool run depending on the cluster configuration. Currently with 7 brokers and 2 gateways we saw this error when the last backpressure reported was 0.78% and total PI/s started were 255. Not sure what happens suddenly that causes these errors.

Any further hints would be helpful. Thanks.

korthout · February 14, 2023, 1:08pm

Hi @jgeek1

I’m unfamiliar with the provided grpc netty warning and I cannot reproduce it.

However, it seems to indicate what I already said before. The connection is closed for some reason. Typically, this would be due to client-side issues, e.g., request timeouts or stopping the client.

I discussed this with the team, and we know one common user scenario where the request may timeout:

the client asks for jobs (polling happens automatically for most job workers, but uses this same RPC)
the gateway keeps the connection open due to long polling
the request times out on the client-side (default 10s on java-client), the client closes the connection
a new job becomes available after the request timed out
the broker hands the job to the gateway
the gateway cannot hand the job to the client because the connection is closed
the gateway logs this failure at DEBUG level because it can recover the error by re-activating the job (if that also fails, it logs this failure at INFO level)

So, on Zeebe’s side there is no ERROR or WARN logged, as this case is expected and handled transparently by Zeebe.

However, other cases may exist. Networks are simply not reliable. Typically, when an error occurs such that the gateway cannot provide the job to the client, then it will attempt to reactivate the job. If that also fails, the broker will eventually timeout the job, effectively achieving the same, albeit with a delay.

I’m not sure what else we can do here without having a reproducible example. So I hope this helps.

jgeek1 · February 14, 2023, 1:53pm

Thanks @korthout for the response. Your explanation on the client side timeout theoretically matches with what we are seeing when running the benchmarking tool with a few differences.

the gateway logs this failure at DEBUG level because it can recover the error by re-activating the job (if that also fails, it logs this failure at INFO level)

We aren’t seeing any debug logs but we constantly see warnings reported in the zeebe-gateway

So, on Zeebe’s side there is no ERROR or WARN logged, as this case is expected and handled transparently by Zeebe.

This isn’t true. There are constant warnings in the zeebe-gateway logs stating

However, other cases may exist. Networks are simply not reliable.

Are you sure the “Failed to send activated jobs to client” error and “Stream closed error” are the same?
This doesn’t seem to be a network error because it always reproduces when the number of processes started on the cluster crosses the limit that it can handle. With 7 brokers,7 partitions and 2 gateways we saw the started PI/s reaching 223 and with 7 brokers, 11 partitions and 2 gateways we saw it to be 268.

I’m not sure what else we can do here without having a reproducible example.

I think if you just execute the benchmark tool against a specific cluster configuration the issue would always reproduce. Kindly let me know if I can help further in setting up the cluster and running the tool.

Thanks.

jgeek1 · February 15, 2023, 11:08am

Based on our above discussion I tried a couple of things but with no luck. The warnings still appear in the gateway

Increased the zeebe.client.job.timeout property for the benchmarking tool to 3600000
Increased the requestTimeout when creating the process instance to 5400 secs.

None of these help. One important observation is these warnings only appear when the cluster configuration is high. I get this error when running with 6 or 7 brokers where the PI/s go above 200 while I don’t get this error when running with just 2 brokers where the PI/s go upto 54. Wonder why?

Thanks.