Serving LLMs on a Budget: Notes From a Small Team
Small teams running language models in production face a different problem than frontier labs. The question isn’t “how do we push the state of the art” — it’s “how do we get predictable quality, latency, and cost out of a model we didn’t train.”
This post collects the patterns that held up across a handful of projects: picking a base that is slightly smaller than you think you need, investing early in a repeatable evaluation loop, and treating the serving stack as a product surface rather than an afterthought.
The short version: most of the wins come from boring decisions — sensible batching, a cache you actually trust, and the discipline to say no to features that would push you into a larger model tier.