Camunda 8 on Google Kubernetes Engine: Identity/Keycloak redirect loop

jack · November 11, 2024, 3:19pm

Continuing my journey to try to get Camunda 8 self-hosted on Google Cloud with Google Kubernetes engine, I made a lot of progress and most of the stuff, from pods to ingress, seems to be working now.

Unfortunately, I fail to login when I open Tasklist oder Operate as the browser gets stuck in a redirect loop between Operate and Keycloak. I’m out of ideas of what might cause this.

I have setup an combined ingress in GKE that I also now have setup with a valid, non-self-signed certificate.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: camunda-ingress
  annotations:
    kubernetes.io/ingress.global-static-ip-name: "my-ip"
    ingress.gcp.kubernetes.io/pre-shared-cert: "my-cert"
spec:
  rules:
  - http:
      paths:
      - path: /operate
        pathType: Prefix
        backend:
          service:
            name: camunda-operate
            port:
              number: 80
      - path: /auth
        pathType: Prefix
        backend:
          service:
            name: camunda-keycloak
            port:
              number: 80

In my helm values file I configured the redirect URIs accordingly:

global: 
  identity:
    auth:
      publicIssuerUrl: "https://mydomain.com/auth/realms/camunda-platform"
      operate:
        redirectUrl: "https://mydomain.com/operate/operate"

My identityKeycloak section is also not much to write home about:

identityKeycloak:
  service:
    annotations:
      cloud.google.com/neg: '{"ingress": true}'
  readinessGates:
    - conditionType: "cloud.google.com/load-balancer-neg-ready"

I did not more adaptations to anything related to Identity or Keycloak.

When I open Operate, I get redirected correctly to Keycloak for the login page. I can login and the login seemingly is successful as I get the redirect after that and I can see cookies from Keycloak are set. But it now goes in an infinite redirect loop until the browser detects the infinite loop.

What could be causing this? I even tried to put Keycloak on a separate non-combined ingress as I read somewhere that combined ingress might cause issues (althought only non-HTTPS) but it did not change anything.

I’m puzzled of where I could even start to look for the cause of this.

That’s how it looks like in the browsers network tab (for some reason the requests do not seem to be in actual order):

Can anybody point me in the right direction of what might be causing this?

Update: I now see errors in operate logs about not being authenticated when this happens:

io.camunda.webapps.controllers.WebappsRequestForwardManager - Requested path /operate/identity-callback, but not authenticated. Redirect to /api/login

Hmmm. But why. Login succeeds. Redirect happens. Is it the wrong redirect endpoint of Operate?

Alex_Voloshyn · November 11, 2024, 4:17pm

Hi @jack

redirectUrl: "https://mydomain.com/operate/operate

“operate” specified two times, is it on purpose? Just looking for something suspicious…

Regards,
Alex

cpbpm · November 11, 2024, 4:30pm

if possible, please share the chart values you used for this deployment.

jack · November 11, 2024, 4:34pm

Hi @Alex_Voloshyn

Thanks for helping out again.

Yep that’s definitely wrong … it’s because the first /operate is my combined ingress path, while the second operate is the actual operate path of Operate itself. But looking at my local Compose setup, /identity-callback does seem to live under root and not under /operate …

But … I suspected that somehow already and had now removed the second operate and it still does not work. Operate logs show:

io.camunda.webapps.controllers.WebappsRequestForwardManager - Requested path /operate/identity-callback, but not authenticated. Redirect to /api/login

I wonder if the /operate in the path from the GKE Ingress path might still be an issue here.

jack · November 11, 2024, 4:37pm

Hi @cpbpm

Here’s the complete values.yml I’m using as of right now, just redacted the domain used:

global: 
  identity:
    auth:
      publicIssuerUrl: "https://my-domain.com/auth/realms/camunda-platform"
      operate:
        redirectUrl: "https://my-domain.com/operate"
      tasklist:
        redirectUrl: "https://my-domain.com/tasklist"
      optimize:
        redirectUrl: "https://my-domain.com/optimize"

zeebe:
  clusterSize: 2 # Reduce cluster size for Integration environment from 3 to 2 to save resources
  partitionCount: 2
  replicationFactor: 2
  resources:
    requests:
      cpu: "400m" # Reduce CPU requests for Integration environment by 50 % 

operate:
  resources:
    requests:
      cpu: "300m" # Reduce CPU requests for Integration environment by 50 % 
  service:
    annotations:
      cloud.google.com/neg: '{"ingress": true}' # Creates a NEG after an Ingress is created
      beta.cloud.google.com/backend-config: '{"default": "camunda-hc-backendconfig"}' # Attach the backend config to the service
  readinessGates:
    - conditionType: "cloud.google.com/load-balancer-neg-ready"  # GKE native ingress needs this gate for LB

tasklist:
  resources:
    requests:
      cpu: "200m" # Reduce CPU requests for Integration environment by 50 %
  service:
    annotations:
      cloud.google.com/neg: '{"ingress": true}' # Creates a NEG after an Ingress is created
      beta.cloud.google.com/backend-config: '{"default": "camunda-hc-backendconfig"}' # Attach the backend config to the service
  readinessGates:
    - conditionType: "cloud.google.com/load-balancer-neg-ready"  # GKE native ingress needs this gate for LB

identity:
  resources:
    requests:
      cpu: "300m" # Reduce CPU requests for Integration environment by 50 %
  service:
    annotations:
      cloud.google.com/neg: '{"ingress": true}' # Creates a NEG after an Ingress is created
      beta.cloud.google.com/backend-config: '{"default": "camunda-identity-backendconfig"}' # Attach the backend config to the service
  readinessGates:
    - conditionType: "cloud.google.com/load-balancer-neg-ready"  # GKE native ingress needs this gate for LB

identityKeycloak:
  service:
    annotations:
      cloud.google.com/neg: '{"ingress": true}' # Creates a NEG after an Ingress is created
  readinessGates:
    - conditionType: "cloud.google.com/load-balancer-neg-ready"  # GKE native ingress needs this gate for LB

elasticsearch:
  master:
    replicaCount: "1" # Reduce replica count for Integration environment from 3 to 1 to save resources
    resources:
      requests:
        cpu: "750m" # Reduce CPU requests for Integration environment by 250m

jack · November 11, 2024, 9:59pm

I finally found my quirks. I needed to set the contextPath of Operate to /operate if that’s what I use for the combined ingress path.

operate:
  contextPath: "/operate"
  resources:
    requests:
      cpu: "300m" # Reduce CPU requests for Integration environment by 50 % 
  service:
    annotations:
      cloud.google.com/neg: '{"ingress": true}' # Creates a NEG after an Ingress is created
      beta.cloud.google.com/backend-config: '{"default": "camunda-operate-backendconfig"}' # Attach the backend config to the service
  readinessGates:
    - conditionType: "cloud.google.com/load-balancer-neg-ready"  # GKE native ingress needs this gate for LB

I somehow thought Operate (or the other webapps) would ignore the first ignress /operate route - that’s because I had to use /operate/operate to actually get to the login which led me to the impression that this is somehow how it works.

In hindsight, it’s obvious … but it’s kind of confusing sometimes to get off the beaten path with Camunda’s default ingress as I have to use GCP ingress that’s configured quite differently.

Well, for anyone that might come after me: if you set contexPath, be aware that this path will then also be applied to the 9600 health check endpoint, so you have to update your GCP health checks to /[yourContextPath]/actuator/health/… so they won’t stop working.

atultewari · November 14, 2024, 12:25am

Hello Jack,
We’ve been struggling to get our Camunda 8.6 environment running in GKE using gke-ingress. Would you mind sharing all values.yaml you used for a successful deployment of Camunda with all components (identity, keycloak, tasklist, operate, optimize, zeebe) enabled?

The only two components that seem to be working using gke-ingress for us are:

/optimize
/auth

For tasklist and operate, we are experiencing this error when navigating to /tasklist and /operate

Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.

Any pointers will help? Your posts above were very helpful, but we aren’t able to get the tasklist and operate endpoints to work.

Thanks and best regards,
Atul

jack · November 14, 2024, 7:55pm

Hi @atultewari

The error you are seeing happend to me when the GKE load balancer thought the backend/service was unhealthy.

By default, the load balancer sends health checks to the serving port of the service, which is in case of Camunda different from the endpoint serving accessible health check endpoints.

For auth, it’s the same port. That’s why it’s likely working for you.

You can’t reconfigure the port for the health checks in the GCP portal, it will always get reset to the serving port. You need to deploy a specific backend config to the cluster that configures the health check endpoint, for example for Operate:

# Default health checks do not allow another port than 80
# Camunda uses port 9600 for health checks, so we need to create a backend config to allow this
# This backend config needs to be applied to each service that needs this different health check path
# -> Annotation on the service: beta.cloud.google.com/backend-config: '{"default": "camunda-hc-backendconfig"}' 
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: camunda-operate-backendconfig
spec:
  healthCheck:
    checkIntervalSec: 30
    timeoutSec: 10
    unhealthyThreshold: 2
    healthyThreshold: 1
    type: HTTP
    requestPath: /operate/actuator/health/readiness
    port: 9600 # Custom health check port

The /operate in the request path ins only needed if you use combined ingress. You need this backend config for tasklist, operate, identity, zeebe

Here’s also my values.yaml for reference, the backend config needs to be in the annotations of the service!

global:
  identity:
    auth:
      publicIssuerUrl: "https://***/auth/realms/camunda-platform"
      operate:
        redirectUrl: "https://***/operate"
      tasklist:
        redirectUrl: "https://***/tasklist"
      identity:
        redirectUrl: "https://***/identity"
      optimize:
        redirectUrl: "https://***/optimize"

zeebe:
  clusterSize: 2 # Reduce cluster size for Integration environment from 3 to 2 to save resources
  partitionCount: 2
  replicationFactor: 2
  resources:
    requests:
      cpu: "400m" # Reduce CPU requests for Integration environment by 50 %

zeebeGateway:
  service:
    annotations:
      cloud.google.com/app-protocols: '{"gateway":"HTTP2"}' # Specifies HTTP2 for gRPC
      beta.cloud.google.com/backend-config: '{"default": "camunda-zeebe-backendconfig"}' # Attach the backend config to the service
  extraVolumes:
    - name: zeebe-tls-cert
      secret:
        secretName: zeebe-tls-cert
  extraVolumeMounts:
    - name: zeebe-tls-cert
      mountPath: "/path/to/certs"
      readOnly: true

operate:
  contextPath: "/operate"
  resources:
    requests:
      cpu: "300m" # Reduce CPU requests for Integration environment by 50 %
  service:
    annotations:
      cloud.google.com/neg: '{"ingress": true}' # Creates a NEG after an Ingress is created
      beta.cloud.google.com/backend-config: '{"default": "camunda-operate-backendconfig"}' # Attach the backend config to the service
  readinessGates:
    - conditionType: "cloud.google.com/load-balancer-neg-ready" # GKE native ingress needs this gate for LB

tasklist:
  contextPath: "/tasklist"
  resources:
    requests:
      cpu: "200m" # Reduce CPU requests for Integration environment by 50 %
  service:
    annotations:
      cloud.google.com/neg: '{"ingress": true}' # Creates a NEG after an Ingress is created
      beta.cloud.google.com/backend-config: '{"default": "camunda-tasklist-backendconfig"}' # Attach the backend config to the service
  readinessGates:
    - conditionType: "cloud.google.com/load-balancer-neg-ready" # GKE native ingress needs this gate for LB

identity:
  contextPath: "/identity"
  fullURL: "https://***/identity"
  resources:
    requests:
      cpu: "300m" # Reduce CPU requests for Integration environment by 50 %
  service:
    annotations:
      cloud.google.com/neg: '{"ingress": true}' # Creates a NEG after an Ingress is created
      beta.cloud.google.com/backend-config: '{"default": "camunda-identity-backendconfig"}' # Attach the backend config to the service
  readinessGates:
    - conditionType: "cloud.google.com/load-balancer-neg-ready" # GKE native ingress needs this gate for LB

identityKeycloak:
  service:
    annotations:
      cloud.google.com/neg: '{"ingress": true}' # Creates a NEG after an Ingress is created
  readinessGates:
    - conditionType: "cloud.google.com/load-balancer-neg-ready" # GKE native ingress needs this gate for LB

elasticsearch:
  master:
    replicaCount: "1" # Reduce replica count for Integration environment from 3 to 1 to save resources
    resources:
      requests:
        cpu: "750m" # Reduce CPU requests for Integration environment by 250m

I was not able to get zeebe-gateway ingress working yet, though, due to issues enabling TLS on the service.

atultewari · November 14, 2024, 8:05pm

Thank you so much Jack. We were able to get /operate and /tasklist working for health checks. But /identity health check is still failing. Once these are resolved, we will be working in the Zeebe TLS issue as well, so would love to compare notes. IS your install using context path .identity? And is the health-check working for you?

for some reason, Health-Check for identity is still failing even though we confirmed we are able to successful check for readiness at the service level IP:82/identity/actuator/health. Service maps port 82 to 8082.

We are using a contextPath /identity. GCP’s Healthcheck FW rule includes port 82. Here are the values.yaml for identity:

Camunda

global:
  identity:
    auth:
      publicIssuerUrl: "https://dev-camunda.abcdefg.dev/auth/realms/camunda-platform"
      operate:
        redirectUrl: "https://dev-camunda.abcdefg.dev/operate"
      tasklist:
        redirectUrl: "https://dev-camunda.abcdefg.dev/tasklist"
      optimize:
        redirectUrl: "https://dev-camunda.abcdefg.dev/optimize"
      identity:
        redirectUrl: "https://dev-camunda.abcdefg.dev/identity"
  

identity:
  contextPath: "/identity"
  service:
    annotations:
      cloud.google.com/neg: '{"ingress": true}' # Creates a NEG after an Ingress is created
      cloud.google.com/backend-config: '{"default": "camunda-hc-identity"}' # Attach the backend config to the service

Ingress

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: camunda-ingress
  annotations:
    kubernetes.io/ingress.global-static-ip-name: "camunda-dev-ingress"
    kubernetes.io/ingress.class: "gce"
    networking.gke.io/managed-certificates: "dev-camunda"
spec:
  rules:
    - host: dev-camunda.abcdefg.dev
      http:
        paths:
        - path: /operate
          pathType: Prefix
          backend:
            # This assumes http-svc exists and routes to healthy endpoints
            service:
              name: camunda-operate
              port:
                number: 80
        - path: /tasklist
          pathType: Prefix
          backend:
            # This assumes http-svc exists and routes to healthy endpoints
            service:
              name: camunda-tasklist
              port:
                number: 80
        - path: /optimize
          pathType: Prefix
          backend:
            # This assumes http-svc exists and routes to healthy endpoints
            service:
              name: camunda-optimize
              port:
                number: 80
        - path: /auth
          pathType: Prefix
          backend:
            # This assumes http-svc exists and routes to healthy endpoints
            service:
              name: camunda-keycloak
              port:
                number: 80
        - path: /identity
          pathType: Prefix
          backend:
            # This assumes http-svc exists and routes to healthy endpoints
            service:
              name: camunda-identity
              port:
                number: 80

Any insight on why identity is not working even through the probe works from within the cluster from another pod to the service :82/actuator/health

Regards,
Atul

jack · November 14, 2024, 9:43pm

Here’s my backend config for Identity.

Remove the /identity from the requestPath despite you’re using this subroute. For some reason this seems to work differently in Identity.

apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: camunda-identity-backendconfig
spec:
  healthCheck:
    checkIntervalSec: 30
    timeoutSec: 10
    unhealthyThreshold: 2
    healthyThreshold: 1
    type: HTTP
    requestPath: /actuator/health
    port: 8082 # Custom health check port

Also don’t forget to set fullURL: "https://yourdomain.com/identity" for identity. I don’t think it’s relevant for the health check, but it is relevant so that authentication redirects are working correctly later.

atultewari · November 14, 2024, 10:13pm

Finally figured out the Healthcheck for Identity. But whenever I navigate to https://domain.com/identity, it get’s redirected to http://localhost:8080/auth

What am I doing wrong? I have this

identity:
  contextPath: "/identity"
  service:
    annotations:
      cloud.google.com/neg: '{"ingress": true}' # Creates a NEG after an Ingress is created
      cloud.google.com/backend-config: '{"default": "camunda-hc-identity"}'

and an ingress

https://domain.com/identity to http://k8s-identity-svc:80/

Thoughts?

jack · November 15, 2024, 8:40am

You need to add this:

identity:
  contextPath: "/identity"
  fullURL: "https://yourdomain.com/identity"

atultewari · November 15, 2024, 1:41pm

Thank you Jack. The missing fullURL was a surprising strange fix for identity. Thank you so much again.

BTW, we are working on Zeebe. Apparently, we need to use ingress-nginx, and install it using:

helm upgrade --install  ingress-nginx ingress-nginx --repo https://kubernetes.github.io/ingress-nginx --set controller.service.loadBalancerIP=###.##.##.### --namespace ingress-nginx --create-namespace

Unfortunately, we may need to use 2 separate ingresses. gce-ingress does not seem to work.

We have not been able to connect to Zeebe yet, but as soon as we have our IT team work through change the DNS, I’ll confirm here.

Have you had luck with connecting to Zeebe through the modeler?

Regards,
Atul

jack · November 15, 2024, 2:40pm

I have a question open too about that here:

I really do not want to have a 2nd ingress set up here, infra is much too complex enough already. I might opt for not exposing Zeebe GRPC endpoint over Ingress at all, as it’s something we normally would only access from inside the cluster anyhow. For deploying from Modeler, I’d use port forwarding over kubectl.

But I have not given up yet on setting it up over GKE Ingress because the traffic is reaching my Zeebe-Gateway just fine as long as HTTP/2 is enabled for the backend. The only issue is that Zeebe-Gateway itself does not enable TLS, but GKE Ingress will only use TLS to communicate with HTTP/2 backends and cannot be forced to a unencrypted connection. So the connection fails, as Zeebe expectes non-TLS traffic, but I can see the failing in Zeebe-Gateway logs, so generally the only issue is Zeebe-Gateway here.

There seems to be no possibility to force Zeebe into TLS over Helm charts, except when using Camunda’s ingress settings. But there is documentation that shows it can be set over environment variables too, so I might try that next. But for now I just go with what I have as exposing Zeebe GRPC over Ingress is only a convenience, not necessary.

As for your question on Modeler, you need to authenticate with Identity on. And it seems Modeler does not support TLS too, so that could be an issue. I was able to connect to Zeebe via Modeler by using non-TLS connection over port forwarding and also by using Authentication with ClientId/ClientSecret.

system · November 22, 2024, 2:40pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.