The Horizontal Pod Autoscaler looks like one of Kubernetes' friendliest features: a Deployment, a target metric, a min, a max, and replica counts become someone else's problem. These are the traps that catch people most often.

Common Gotchas When Working with HPAs in Kubernetes

The Horizontal Pod Autoscaler looks simple. Here’s where it bites.

The Horizontal Pod Autoscaler — or HPA — looks like one of Kubernetes’ friendliest features. A Deployment, a target metric, a min, a max — a dozen lines of YAML and replica counts become someone else’s problem. Set it and forget it.

The trouble is that the HPA is very good at appearing to work. When it’s misconfigured, it rarely throws an error. It just quietly does nothing, or does something subtly wrong, while you refresh a dashboard trying to figure out why the pod count won’t budge. These are the traps that catch people most often — myself included.

First off — how does the HPA actually decide anything?

Before the gotchas, it’s worth demystifying what the HPA actually is, because it’s really nothing fancy. It’s a control loop inside the kube-controller-manager that wakes up every 15 seconds, fetches a metric, and runs one line of math:

desiredReplicas = ceil( currentReplicas × currentMetricValue / targetMetricValue )

That’s the whole show. If you’re running 4 replicas averaging 90% CPU utilization against a 60% target, it computes ceil(4 × 90/60) = 6 and sets your Deployment’s scale to 6. There’s an entire metrics pipeline underneath — the resource metrics API, custom metrics, external metrics — that’s beyond the scope of this article, but here’s the thing to internalize: every gotcha below is just one of the inputs to that formula going sideways.

Gotcha #1: No metrics, no autoscaler

If you’ve ever created an HPA and seen this, you’ve hit the most common gotcha of them all:

$ kubectl get hpa
NAME   REFERENCE        TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
web    Deployment/web   <unknown>/70%   2         20        3          6m

The HPA doesn’t collect metrics itself. It asks the resource metrics API, and that API is served by metrics-server, which is an addon — plenty of clusters don’t ship it by default, with kubeadm-built clusters and stock EKS being two big offenders.

No metrics-server, no metrics. No metrics, no scaling. Without it, the HPA will report <unknown> and quietly do nothing forever.

The quick check is kubectl top pods, since it’s output is populated by the metrics-server API as well. If that comes back with error: Metrics API not available, you’ve found your problem. On production clusters, I would highly recommend the metrics-server gets installed systematically, precisely because so much downstream tooling silently depends on it being there.

One common option is to install it alongside your other cluster add-ons like ingress-nginx, cert-manager, and external-secrets in an ArgoCD ApplicationSet.

One caveat if you’re running a homelab: metrics-server verifies kubelet serving certificates, and if yours are self-signed it’ll crash-loop right out of the box. The --kubelet-insecure-tls flag is the usual escape hatch there — fine at home, not something you want anywhere near production.

Gotcha #2: “Utilization” means percent of requests

$ kubectl get hpa
NAME   REFERENCE        TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
web    Deployment/web   <unknown>/70%   2         20        3          6m

Here’s a question worth asking out loud: 70% utilization of what, exactly? Not the node. Not the limit. The HPA calculates utilization as a percentage of the container’s resource request.

If your container requests 100m of CPU and your target is 70%, the HPA starts adding pods once average usage crosses 70m — even if the limit is a full core and the node is bored stiff.

That’s because requests are the only stable, declared baseline the HPA has to measure against. Which leads to a rule worth stating as bluntly as possible: if the resource you’re scaling on doesn’t have a request set, no scaling happens on that metric. Period.

The HPA can’t take a percentage of a number that doesn’t exist, so it reports <unknown> and does nothing. It’s worth nothing that the rule is per-resource, CPU requests don’t help a memory-based HPA one bit. Scale on memory Utilization and every container needs a memory request, even if your CPU requests are immaculate. The failure shows up waiting for you in kubectl describe hpa:

Warning  FailedGetResourceMetric  horizontal-pod-autoscaler
failed to get cpu utilization: missing request for cpu

The sneakier version of the same rule — and this is the one that really gets people — is that every container in the pod needs the request, sidecars included, because the resource metric is computed across the whole pod by default. Inject a sidecar without a CPU request and a previously healthy HPA stops dead.

If you’re in that boat, versions 1.30+ of Kubernetes give you a stable ContainerResource metric type that targets a single named container instead of the pod as a whole, which is a much cleaner way out than chasing down requests for every injected container.

The other escape hatch: this whole gotcha only bites Utilization targets, since they’re percentages of requests. Target an absolute number with type: AverageValue instead (e.g. averageValue: 500Mi) and there’s nothing to divide by, so no requests are required.

The flip side of this gotcha is subtler: because utilization is relative to requests, your requests basically are your scaling policy. Set them too low and you’ll scale out long before the hardware needs it; set them too high and the HPA may never wake up at all.

Gotcha #3: It’s an average — and there’s a dead band

The HPA doesn’t look at your hottest pod. It averages the metric across all the pods behind the target. Think of it like a thermostat that averages the temperature across every room in the house: one room can be on fire while the rest are chilly, and the furnace never kicks on. Ten pods where one is pegged at 100% and nine idle at 10% averages out to 19% — well under any sane target, so nothing happens.

If sticky sessions or long-lived gRPC connections are piling load onto a couple of pods, the HPA is the wrong tool for the job; the real problem is load distribution.

And even when the average does creep past your target, there’s a tolerance band. By default the controller ignores anything within 10% of the ratio between current and desired — so with a 70% target, an average of 75% won’t move the needle (75/70 ≈ 1.07, inside the band). If you’ve ever wondered why your HPA is sitting calmly at slightly-over-target, that’s why.

The band is a cluster-wide kube-controller-manager flag (--horizontal-pod-autoscaler-tolerance), which you almost certainly can’t touch on a managed cluster, so just design your targets knowing the trigger is fuzzy, not exact.

Gotcha #4: Your GitOps tool is fighting your HPA

This one is near and dear to my heart, since I run everything through ArgoCD. If your Deployment manifest sets spec.replicas and an HPA manages the same Deployment, you now have two controllers writing to one field. The HPA scales you to 6, the next sync sees drift and puts you back to 3 — or at minimum, ArgoCD marks the app perpetually OutOfSync until someone eventually “fixes” it. Every sync becomes a tiny, pointless scale event.

The right move is to remove spec.replicas from the manifest entirely and let the HPA own it. But there’s a nasty sub-gotcha buried in the migration: if the field was previously applied and you delete it from the YAML, a client-side kubectl apply drops the field, and the API server helpfully defaults it to 1. Your ten-replica service momentarily becomes a one-replica service until the HPA claws it back.

Server-side apply handles the ownership handoff gracefully, which is one of several good reasons to switch it on.

For ArgoCD specifically, either stop rendering replicas when autoscaling is enabled — the standard Helm chart pattern of wrapping it in {{- if not .Values.autoscaling.enabled }} does exactly this — or tell ArgoCD the field isn’t its business:

ignoreDifferences:
  - group: apps
    kind: Deployment
    jsonPointers:
      - /spec/replicas

Gotcha #5: Scale-down isn’t broken — it’s stabilized

Load spikes, the HPA scales you to 12, load drops, and… the pods just sit there. Five minutes later they finally start draining. Nothing is broken. By default, scale-down uses a 300-second stabilization window: the controller looks at every desired replica count it computed over the last five minutes and acts on the highest one. That’s because without it you’d get flapping — pods churning up and down with every wiggle in the metric, which is far worse than a few idle replicas.

Scale-up has its own perceived lag, too. Between the 15-second controller sync loop, the metrics-server scrape interval, image pulls, and readiness probes, a good 30–60 seconds can pass between “traffic arrived” and “new pod serving.” The HPA is reactive by nature; if you need capacity before the spike, that’s a job for scheduled scaling or something like KEDA, not a lower target.

The good news is that autoscaling/v2 gives you a behavior block to tune all of this per-HPA, and it’s massively useful. Here’s a finished copy of the kind of HPA I’d start from — you’re more than welcome to copy and paste it and adjust the numbers:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15
        - type: Pods
          value: 4
          periodSeconds: 15
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60

A quick walkthrough. First, scaleUp is aggressive on purpose: no stabilization window, and it can double the replica count or add 4 pods every 15 seconds, whichever is greater — when you need capacity, you need it now. Then, scaleDown keeps the 5-minute window but swaps the default policy (which allows dropping up to 100% of your pods every 15 seconds) for a gentler bleed of 2 pods per minute. That way a brief lull in traffic doesn’t gut your capacity right before the next wave hits.

Gotcha #6: Scaling on memory is usually a trap

CPU comes back down when load drops. Memory, for a ton of runtimes, does not. Garbage-collected languages like Java and Go grab heap and hold onto it, so a memory-based HPA happily scales you up during the spike and then never scales you back down — the pods still look full. On top of that, a chunk of most pods’ memory is fixed baseline overhead that doesn’t divide across replicas, so adding pods barely moves the average anyway.

And Gotcha #2 applies here in full force: a memory Utilization target needs a memory request on every container, or the HPA won’t scale at all. Teams tend to set CPU requests religiously for scheduling and skip memory requests entirely, which is probably the single most common way a memory-based HPA ends up sitting at <unknown> doing nothing.

And here’s the interaction that really stings: when an HPA has multiple metrics, it computes a desired replica count for each one and takes the maximum. Add a memory metric alongside your CPU metric and the memory one can pin you at maxReplicas indefinitely while CPU sits idle. If memory genuinely tracks per-request load in your app, fine — but for most services you’re better off with CPU or, better yet, a real demand signal like requests-per-second through the custom metrics API.

When in doubt, kubectl describe hpa

If you take one debugging habit away from this article, make it this one. kubectl describe hpa <name> shows you three status conditions — AbleToScale, ScalingActive, and ScalingLimited — plus the event stream, and between them they’ll name-check nearly every gotcha above: missing metrics, missing requests, a maxed-out ceiling, a stabilization window holding things steady. I cannot overemphasize how many “why isn’t my HPA scaling” mysteries end about thirty seconds after someone finally runs it.

Take your HPA for a test drive

An HPA you’ve never watched scale is an HPA you’re trusting on faith. Before real traffic does the testing for you, spend ten minutes forcing a scale-up and a scale-down so you know the whole pipeline — metrics, math, and behavior — works the way you think it does.

First, get a window open on the action. In one terminal, watch the HPA itself:

kubectl get hpa web -w

The -w flag streams every change, so you’ll see the TARGETS column climb and the REPLICAS column react in real time. In a second terminal, keep an eye on the raw numbers with kubectl top pods — just remember from Gotcha #5 that those readings trail reality by a scrape interval or so.

Now, generate some load. The most realistic way is to push traffic through the Service, since that spreads load across all the pods the way production would:

kubectl run load-generator --rm -it --image=busybox:1.36 --restart=Never -- \
  /bin/sh -c "while true; do wget -q -O- http://web > /dev/null; done"

If one generator doesn’t move the needle, fire up two or three more with different names (load-generator-2, and so on). Within a minute or two you should see utilization cross your target, the replica count start climbing, and events landing in kubectl describe hpa:

Normal  SuccessfulRescale  horizontal-pod-autoscaler
New size: 6; reason: cpu resource utilization (percentage of request) above target

If you can’t easily route traffic at your workload, the quick-and-dirty alternative is to burn CPU inside a pod directly:

kubectl exec -it deploy/web -- sh -c 'yes > /dev/null'

Two things to know about that one. It’s a busy loop pegging a core in a single pod — and as we covered in Gotcha #3, the HPA scales on the average, so one hot pod among several idle ones may not budge the number at all. (Which, honestly, makes it a pretty great live demo of Gotcha #3.) And if your image is distroless there’s no shell to exec into, so you’re back to the load generator.

Scaling down is the half people forget to test. Kill the load — Ctrl+C the generators and the --rm flag cleans them up — and then wait. Remember Gotcha #5: the default 300-second stabilization window means nothing visible happens for five minutes, and that’s the system working, not hanging. Set a timer before you declare it broken. Once the window passes, the replica count should bleed down at whatever pace your scaleDown policies allow and settle at minReplicas.

If both directions behave — up under load, down without it — you’ve verified the loop end to end: metrics-server is feeding real numbers, requests are set where they need to be, and your behavior block is doing exactly what you tuned it to do.

Finally, if you’re feeling adventurous, the natural next step past resource metrics is scaling on what actually matters to your users — queue depth, RPS, latency — via the custom and external metrics APIs, with tools like KEDA or Prometheus Adapter doing the plumbing. That’s a topic for another article.

Additional Resources

If you want to go deeper, here’s where I’d point you. The official docs on this particular corner of Kubernetes are genuinely good:

  • Horizontal Pod Autoscaling — Kubernetes docs — the concepts page. The scaling algorithm, the tolerance band, stabilization windows, and the spec.replicas migration notes from Gotcha #4 all live here.
  • HPA Walkthrough — Kubernetes docs — the official hands-on tutorial, complete with a php-apache demo app and load generator. A ready-made playground if you’d rather test-drive against a sample app than your own workload.
  • HorizontalPodAutoscaler v2 API reference — every field in the metrics and behavior blocks, for when you’re tuning the manifest from Gotcha #5.
  • metrics-server — install manifests, the Helm chart, and the requirements list, including the kubelet certificate details behind the homelab caveat in Gotcha #1.
  • ArgoCD diffing customization — the docs for ignoreDifferences and friends, if Gotcha #4 hit close to home.
  • KEDA — event-driven autoscaling with a big catalog of scalers for queues, databases, cron schedules, and more. The natural upgrade path once resource metrics stop describing your actual load.
  • Prometheus Adapter — exposes your Prometheus metrics through the custom metrics API so the HPA can scale on them directly.

Enjoy!