KubernetesNGINXAKSDevOpsSRE

Migrating from Community ingress-nginx to F5 NGINX Ingress Controller Across 3 AKS Clusters

May 16, 202614 min read

Earlier this month I migrated three production AKS clusters off the community ingress-nginx controller and onto the F5 NGINX Ingress Controller OSS (v2.5.1). The three workloads were a compliance API service, a real-time WebSocket trading server, and a charting frontend. Same controller name, completely different internals — and enough sharp edges to fill a post.

This is the full account: what changed, what broke, and the patterns I standardised across all three.

Why Migrate

The community Helm chart (kubernetes/ingress-nginx) and the F5 chart (nginx-stable/nginx-ingress) both proxy traffic through NGINX, but they diverge at almost every other layer — Helm structure, annotation prefixes, config key names, metrics port, and label selectors. F5 NGINX IC is the upstream-maintained version aligned with NGINX OSS releases and gives tighter control over the NGINX config without relying on the community's annotation translation layer.

The practical trigger was a mix of factors: the community chart had accumulated workarounds for bugs we no longer needed, the annotation surface was getting hard to audit, and we wanted a single, consistent ingress stack across clusters.

What Stayed the Same

Before diving into the diffs, here is what did not change:

IngressClass name remains nginx in every cluster (no application-level changes needed)
Azure Load Balancer type (internal where it was internal, public where public)
cert-manager ClusterIssuers (one field rename, covered below)
Linkerd injection on controller pods

The Migration Playbook

Every cluster followed the same five-step pipeline:

# 1. Pull the F5 chart via OCI — no helm repo add needed
helm pull oci://ghcr.io/nginx/charts/nginx-ingress \
  --version 2.5.1 \
  --destination /tmp/charts/

# 2. Verify checksum before touching anything
echo "23c866c0531719586570435a4d9a57ac0fb9661fdafd572c8916208cb7b4f225  /tmp/charts/nginx-ingress-2.5.1.tgz" \
  | sha256sum --check

# 3. One-time IngressClass migration guard
CONTROLLER=$(kubectl get ingressclass nginx \
  -o jsonpath='{.spec.controller}' 2>/dev/null || true)

if [ "${CONTROLLER}" = "k8s.io/ingress-nginx" ]; then
  echo "Removing community IngressClass — allowing F5 takeover"
  kubectl delete ingressclass nginx
fi

# 4. Helm upgrade
helm upgrade --install nginx-ingress /tmp/charts/nginx-ingress-2.5.1.tgz \
  --namespace nginx-ingress \
  -f values.yaml \
  --wait --timeout 5m

# 5. Verify the right controller is running
kubectl get pods -l app.kubernetes.io/name=nginx-ingress -n nginx-ingress

Step 3 deserves its own section.

The IngressClass Immutability Trap

spec.controller on an IngressClass resource is immutable after creation. The community controller sets it to k8s.io/ingress-nginx; the F5 controller expects nginx.org/ingress-controller. If you just run helm upgrade, F5 will fail to adopt the existing IngressClass and create a conflicting one — or worse, silently ignore it and not process any Ingress resources.

The solution is to delete the IngressClass before the first F5 install. But a naive unconditional delete is dangerous in an idempotent pipeline — if someone reruns the pipeline after migration, they'd delete the already-correct F5-owned IngressClass mid-flight, causing a brief outage.

The guard condition solves this:

if [ "${CONTROLLER}" = "k8s.io/ingress-nginx" ]; then
  kubectl delete ingressclass nginx
fi

After the first successful F5 install, spec.controller reads nginx.org/ingress-controller, so every subsequent pipeline run skips the delete. One-time, idempotent, safe.

Helm Values: Structural Differences

The community chart uses a flat controller.config map. F5 nests everything under controller.config.entries. Small diff, big gotcha if you copy-paste.

Community:

controller:
  config:
    proxy-read-timeout: "600"
    load-balance: "ewma"
    use-gzip: "true"

F5:

controller:
  config:
    entries:
      proxy-read-timeout: "600s"   # note: F5 expects the unit suffix
      lb-method: "ewma"            # key renamed
      # use-gzip has no equivalent — moved to http-snippets

A number of community config keys simply do not exist in F5 and are silently ignored if you leave them in. I audited every key against the F5 config documentation and removed: allow-snippet-annotations, allow-backend-server-header, block-user-agents, enable-vts-status, generate-request-id, limit-req-status-code, use-forwarded-headers, use-geoip, upstream-keepalive-*.

Other keys that F5 does support but with different names:

Community key	F5 equivalent
`load-balance`	`lb-method`
`proxy-read-timeout`	`proxy-read-timeout` + unit suffix
`client-header-timeout`	Move to `http-snippets`

The full base controller config across all three clusters:

controller:
  kind: deployment
  enableCustomResources: false      # not using VirtualServer CRDs
  enableSnippets: true
  telemetryReporting:
    enable: false                   # no outbound access to oss.edge.df.f5.com

  ingressClass:
    name: nginx
    create: true
    setAsDefaultIngress: false

  service:
    annotations:
      service.beta.kubernetes.io/azure-load-balancer-health-probe-protocol: tcp

  metrics:
    enable: true
    port: 9113                      # changed from community's default
    serviceMonitor:
      create: false

Two settings that tripped things up before I caught them:

telemetryReporting.enable: false — F5 attempts to phone home to oss.edge.df.f5.com. In a cluster with no outbound internet on the node pool, this causes the controller pod to crash-loop on startup waiting for the connection to time out. Must be disabled explicitly.

enableCustomResources: false — F5 ships its own CRDs (VirtualServer, TransportServer, Policy). If you leave this enabled and those CRDs aren't pre-installed, the controller crashes. Since all three clusters use standard Kubernetes Ingress resources, I disabled them entirely.

Azure LB health probe — The community controller serves /healthz on port 80. F5 does not. Azure's default HTTP probe on that path will mark all backends unhealthy. Switch to TCP probe.

Rate Limiting: From Annotations to NGINX Snippets

Community ingress-nginx ships first-class annotations for rate limiting:

# community — applied as ingress annotations
nginx.ingress.kubernetes.io/limit-req-rate: "120r/m"
nginx.ingress.kubernetes.io/limit-conn: "60"
nginx.ingress.kubernetes.io/limit-req-status: "429"

F5 NGINX IC does not have equivalent annotation primitives. The correct F5 approach is to declare the rate limit zones globally in http-snippets (controller values) and apply them per-ingress via server-snippets.

Controller values — shared zones:

controller:
  config:
    entries:
      http-snippets: |
        geo $app_limit_bypass {
          default 0;
          <office-cidr-1> 1;
          <office-cidr-2> 1;
        }

        map $app_limit_bypass $app_limit_key {
          0 $binary_remote_addr;
          1 "";
        }

        limit_req_zone  $app_limit_key zone=app_rpm:10m rate=120r/m;
        limit_conn_zone $app_limit_key zone=app_conn:10m;

Ingress manifest — apply per route:

annotations:
  nginx.org/server-snippets: |
    limit_req zone=app_rpm burst=80 nodelay;
    limit_req_status 429;
    limit_conn app_conn 60;
    limit_conn_status 429;

The geo+map pattern lets specific IP ranges (office networks, CI runners, load testing hosts) bypass rate limits by mapping to an empty key — which limit_req_zone treats as unlimited. This is cleaner than maintaining allow-lists in multiple annotation blocks across ingress manifests.

WebSocket Service: Keepalive Surprises

One of the services is a Socket.io server behind WebSocket connections. Everything looked healthy post-migration — pods up, ingress adopted — but Socket.io clients started disconnecting every 30–60 seconds.

The root cause: F5's default keepalive-timeout is 0s (disabled), whereas the community chart defaults to 60s. WebSocket connections through NGINX depend on TCP keepalive to stay alive during idle periods. With keepalive disabled, NGINX was closing the connection server-side.

Fix:

controller:
  config:
    entries:
      keepalive-timeout: "60s"
      http2: "false"   # HTTP/2 and WebSocket upgrades conflict; disable explicitly

Also required adding the F5 WebSocket annotation to the ingress manifest:

annotations:
  nginx.org/websocket-services: "my-websocket-service"

Without this annotation, F5 does not set the necessary Upgrade and Connection proxy headers for WebSocket handshakes. The community controller handled this automatically; F5 requires you to be explicit.

Zero-Downtime Service Selector Patch

One cluster runs a secondary Service that routes specific traffic, and its label selector was hardcoded to the community controller labels:

app.kubernetes.io/name=ingress-nginx
app.kubernetes.io/component=controller

F5 uses app.kubernetes.io/name=nginx-ingress. After migration, the service selector matched nothing — endpoints went empty, traffic dropped.

A plain kubectl apply won't fix this because Kubernetes rejects selector changes on existing Services. Instead, I patched it as a pre-upgrade pipeline step:

kubectl patch service <legacy-service-name> \
  -n nginx-ingress \
  --type='merge' \
  -p '{
    "spec": {
      "selector": {
        "app.kubernetes.io/name": "nginx-ingress"
      }
    }
  }'

The --type='merge' strategy replaces only the specified keys, leaving the rest of the selector intact. Running this before helm upgrade means the service selector matches the new pods the moment they come up.

The broader lesson: grep for ingress-nginx in all Service selectors across your cluster before starting the migration. Any service with a hardcoded community label selector will silently drop traffic after cutover.

cert-manager

One field rename in the ClusterIssuer template — class is deprecated in favour of ingressClassName:

# before
solvers:
  - http01:
      ingress:
        class: nginx

# after
solvers:
  - http01:
      ingress:
        ingressClassName: nginx

Also removed a cert-manager feature gate that was only needed to work around a community ingress-nginx bug (issue #11176) related to path type handling. F5 does not have the bug:

# removed from cert-manager values
featureGates: "ACMEHTTP01IngressPathTypeExact=false"

Datadog Metrics

F5 exposes Prometheus metrics on port 9113 (the community controller used 8080). The existing Datadog auto-discovery config was pointing at the wrong port. I added an OpenMetrics check:

# datadog-agent values.yaml
confd:
  openmetrics.yaml: |-
    ad_identifiers:
      - nginx-ingress
    init_config:
    instances:
      - openmetrics_endpoint: "http://%%host%%:9113/metrics"
        namespace: nginx_ingress
        metrics:
          - nginx_connections_accepted
          - nginx_connections_active
          - nginx_connections_handled
          - nginx_http_requests_total
          - nginx_ingress_controller_ingress_resources_total
          - nginx_ingress_controller_nginx_reloads_total
          - nginx_ingress_controller_nginx_reload_errors_total
          - nginx_ingress_controller_nginx_last_reload_milliseconds

Two things to watch: the file must be named openmetrics.yaml (not nginx-ingress.yaml) for Datadog's catalog to recognise it, and ad_identifiers must match the container name nginx-ingress exactly.

Node Selector Key Update

The community chart uses the deprecated node label key:

beta.kubernetes.io/os=linux

F5 values use the stable GA key:

kubernetes.io/os=linux

Newer AKS node images no longer carry beta.kubernetes.io/os. If your node pool has dropped it, community controller pods won't schedule. Not migration-specific, but worth cleaning up in the same PR.

Helm Upgrade Stability

On cold nodes (newly scaled-up node pool), the F5 controller image pull can take longer than Helm's default 3m timeout. --wait --timeout 5m prevents spurious pipeline failures that previously looked like deployment regressions:

helm upgrade --install nginx-ingress ./nginx-ingress-2.5.1.tgz \
  --namespace nginx-ingress \
  -f values.yaml \
  --wait --timeout 5m

Rollout Issues Timeline

Time	Issue	Fix
T+0	F5 crash-loops on startup	`telemetryReporting.enable: false` + `enableCustomResources: false`
T+0	Linkerd not injecting controller pods	Fixed annotation path: `podAnnotations` → `controller.pod.annotations`
T+0	Datadog scraping wrong port	Added OpenMetrics check on port 9113
T+0	Datadog system-probe seccomp failures	`systemProbe.enabled: false`, `discovery.enabled: false`
T+1h	All LB backends unhealthy	Switched Azure LB probe from HTTP `/healthz` to TCP
T+2h	Socket.io client disconnections	`keepalive-timeout: 60s`, `nginx.org/websocket-services` annotation
T+3h	Secondary service endpoints empty	Pre-upgrade service selector patch
T+24h	Helm timeout on cold nodes	`--wait --timeout 5m`
T+10d	IngressClass delete too aggressive in pipeline reruns	Made delete conditional on `spec.controller` value

The conditional IngressClass delete came last because the unconditional delete worked fine on the first run — the rerun risk only became apparent during a pipeline review afterward.

Key Differences Cheat Sheet

Area	Community ingress-nginx	F5 NGINX IC
Helm source	`kubernetes.github.io/ingress-nginx`	OCI: `ghcr.io/nginx/charts/nginx-ingress`
Chart name	`ingress-nginx`	`nginx-ingress`
Config structure	`controller.config` flat map	`controller.config.entries`
Rate limiting	Annotations (`nginx.ingress.kubernetes.io/*`)	`http-snippets` + `server-snippets`
WebSocket	Automatic	`nginx.org/websocket-services` required
Metrics port	8080	9113
Pod labels	`app.kubernetes.io/name=ingress-nginx`	`app.kubernetes.io/name=nginx-ingress`
IngressClass controller field	`k8s.io/ingress-nginx`	`nginx.org/ingress-controller`
Linkerd annotation path	`podAnnotations`	`controller.pod.annotations`
Node selector key	`beta.kubernetes.io/os`	`kubernetes.io/os`
Telemetry	Off by default	Must disable explicitly
Custom resources	Not applicable	Must disable if not using
LB health probe	HTTP `/healthz`	TCP only

What I Would Do Differently

Audit every config key before migrating. F5 silently ignores unknown config keys. A pre-migration diff against the F5 config reference would have caught the upstream-keepalive-* and use-gzip removals before they hit production.

Test WebSocket apps on a staging cluster first. The keepalive timeout issue was predictable — the default changed between controllers and I didn't check.

Grep for ingress-nginx in all Service selectors before starting. Any hardcoded community label selector silently drops traffic after cutover. Add the selector patch to your playbook as a standard pre-upgrade step, not a reactive fix.

The migration is complete and stable across all three clusters. Ingress configurations are now easier to reason about — NGINX config is NGINX config, not a translation layer of annotations into nginx.conf directives you can't see. If you're running the community chart and considering the switch, the above should give you a realistic picture of what to budget for.