Delay in Zeebe Client and broker

Hi Team,
We have configured 1 broker, 1 gateway with following configuration on self managed clusters
Broker has max 2G heap memory , gateway has max 1G heap memory.
Broker
ZEEBE_BROKER_GATEWAY_ENABLE=false
ZEEBE_BROKER_CLUSTER_NODEID=0
ZEEBE_BROKER_NETWORK_COMMANDAPI_PORT=26601
ZEEBE_BROKER_NETWORK_INTERNALAPI_PORT=26602
ZEEBE_BROKER_NETWORK_COMMANDAPI_HOST=
ZEEBE_BROKER_NETWORK_INTERNALAPI_HOST=1
ZEEBE_BROKER_CLUSTER_PARTITIONSCOUNT=1
ZEEBE_BROKER_CLUSTER_REPLICATIONFACTOR=1
ZEEBE_BROKER_CLUSTER_CLUSTERSIZE=1
ZEEBE_BROKER_CLUSTER_CLUSTERNAME=BPO-ZEEBE
ZEEBE_BROKER_EXPORTERS_ELASTICSEARCH_CLASSNAME=io.camunda.zeebe.exporter.ElasticsearchExporter
ZEEBE_BROKER_EXPORTERS_ELASTICSEARCH_ARGS_URL=<es_host>
ZEEBE_BROKER_EXPORTERS_ELASTICSEARCH_ARGS_BULK_DELAY=1
ZEEBE_BROKER_EXPORTERS_ELASTICSEARCH_ARGS_INDEX_PREFIX=bpo-zeebe-record
ZEEBE_BROKER_THREADS_CPUTHREADCOUNT=4
ZEEBE_BROKER_THREADS_IOTHREADCOUNT=4

Gateway
ZEEBE_STANDALONE_GATEWAY=true
ZEEBE_GATEWAY_CLUSTER_MEMBERID=gateway
ZEEBE_GATEWAY_NETWORK_PORT=26500
ZEEBE_GATEWAY_CLUSTER_PORT=26502
ZEEBE_GATEWAY_NETWORK_HOST=
ZEEBE_GATEWAY_CLUSTER_HOST=
ZEEBE_GATEWAY_THREADS_MANAGEMENTTHREADS=1
ZEEBE_GATEWAY_CLUSTER_INITIALCONTACTPOINTS=
ZEEBE_GATEWAY_CLUSTER_CLUSTERNAME=BPO-ZEEBE

And we have
15 job execution thread(numJobWorkerExecutionThreads)s, 64 max active jobs(jobWorkerMaxJobsActive) config on Zeebe client side.

We are seeing there is delay in execution if we try to do performance test with 20 concurrent users.
We are seeing delay of around 300ms between process instance actiavted to first service task activated on load with 20 concurrent threads.

In the above workflow model.
Difference between activating process instance to first service task is around 300ms.
Our Service task is light weight it doesn’t do complex operation it hardly takes less than 100ms for execution and issuing complete command. Difference betweek activating and completing service is around 300ms, even though execution has taken 100ms.
And again to activate next user task it is taking 100ms.

This data is captured based on Elastic Search query -
http://:9200/bpo-zeebe-record-process-instance/_search?pretty=
{
“size”:200,
“sort”: { “timestamp”: “desc”},
“query”: {
“bool”:{“must”:[{“match”:{“value.processInstanceKey”:{“query”: <process_instance_key>}}}]}
}
}

Please help us what we can keep the ideal configuration for broker and gateway and on Client side for Job workers for higher load.

We have tried with different combination of config on broker and client side.
By increasing partition count of brokers, and increasing brokers to 2 and 4. And having multiple gateways still, there was no significant improvement.

Cluster where Zeebe components are deployed has 8 cores and 16GB memory.

Zeebe components and Zeebe client is deployed in same region, n/w latency is very low

We are using camunda 8.1.5

Let me know If any other details are required.