GPU Pools Without Kubernetes

0 claps

The reflex when infrastructure starts to feel complicated is to reach for a scheduler. For small GPU fleets this is often the wrong move — it trades a tractable operational problem for an opaque control plane that now owns your uptime.

For workloads that are long-running inference rather than short batch jobs, a straightforward pool pattern — a queue, a pool of worker boxes, a shared health check, and an ability to mark a box as draining — carries you a lot further than people expect.

This post covers the version of that pattern I keep reaching for, and the specific failure modes that eventually push you toward something heavier.

GPU Pools Without Kubernetes

More posts

Serving LLMs on a Budget: Notes From a Small Team

Retrieval Is Mostly Data Work

Building an Eval Loop You Actually Trust

Get the newsletter