KAFKA SINK CONNECTOR HIGH CPU UTILISATION

umesh_kushwaha · December 22, 2021, 7:27am

Hi ,

I am performing load test with simple BPMN. BPMN have two task one is source and another one is SINK task. I am using the Zeebe kafka connector open source repository for async communication model. Whenever zeebe client started throwing error “RESOURCE_EXHUSTED” then SINK connector CPU utilisation reached to 100%. I dig into the code and found that recursive call to retry until it receive the successful code or failure code. Load test service is creates instances with the rate of 7K RPM. I also observe that backpressure limit went down to 5.
Please help me to fine tune the performance. We are expecting 100k RPM load in production.

jwulf · January 11, 2022, 5:53am

Well, this is the nature of a distributed system. If the broker gets saturated, then it replies with gRPC Error 8 to signal backpressure to the clients.

How the client responds, and how it communicates backpressure through the system are design considerations.

If you change the Kafka sink to do backoff retry, it may allow the broker to recover, rather than increasing the pressure.

But if the Kafka input does not slow down (if this is sustained load, rather than a burst), you will have a buffering issue in the Kafka sink, which can cause it to run out of memory - in which case data may be lost.

So solutions could be to write backoff-retry in the Kafka sink (buffering when broker is signalling backpressure) to deal with bursts, or to provision the broker cluster to deal with the highest level of load.

jwulf · January 11, 2022, 6:00am

Looks like RESOURCE_EXHAUSTED should back-off the retry here:

github.com

camunda-community-hub/kafka-connect-zeebe/blob/master/src/main/java/io/zeebe/kafka/connect/sink/ZeebeSinkFuture.java#L76

    
      
          (aVoid, throwable) -> {
            if (throwable == null) {
              this.complete(aVoid);
            } else if (throwable instanceof StatusRuntimeException) {
              // handle gRPC errors
              final StatusRuntimeException statusException = (StatusRuntimeException) throwable;
              final Code code = statusException.getStatus().getCode();
              if (SUCCESS_CODES.contains(code)) {
                complete(null);
              } else if (RETRIABLE_CODES.contains(code)) {
                executeAsync();
              } else if (FAILURE_CODES.contains(code)) {
                completeExceptionally(throwable);
              } else {
                LOGGER.warn("Unexpected gRPC status code {} received", code, throwable);
                completeExceptionally(throwable);
              }
            } else {
              completeExceptionally(throwable);
            }
          });

jwulf · January 11, 2022, 6:03am

I opened a feature request in the Kafka connector [0]. Someone who is familiar with Java could implement this and PR it in.

[0] Feature request: Backoff the retry when gateway reports RESOURCE_EXHAUSTED · Issue #70 · camunda-community-hub/kafka-connect-zeebe · GitHub

BerndRuecker · January 11, 2022, 8:20am

Thanks for looking at this Josh and creating the Issue

BerndRuecker · January 26, 2022, 8:35am

Looking at it now - see linked GitHub issue for details

jwulf · January 27, 2022, 10:11pm

Falko and Bernd implemented this: Add backoff to retrying by berndruecker · Pull Request #71 · camunda-community-hub/kafka-connect-zeebe · GitHub

umesh_kushwaha · January 28, 2022, 10:04am

thanks for prioritising this. I will take pull in my local repository and will do load test. I will update the results here.