GPU Pools Without Kubernetes
The reflex when infrastructure starts to feel complicated is to reach for a scheduler. For small GPU fleets this is often the wrong move — it trades a tractable operational problem for an opaque control plane that now owns your uptime.
For workloads that are long-running inference rather than short batch jobs, a straightforward pool pattern — a queue, a pool of worker boxes, a shared health check, and an ability to mark a box as draining — carries you a lot further than people expect.
This post covers the version of that pattern I keep reaching for, and the specific failure modes that eventually push you toward something heavier.