Arsalan Mosenia
Abstract warm amber gradient
LLMs · Infra / 7 min read

Serving LLMs on a Budget: Notes From a Small Team

By Arsalan Mosenia · Published
0 claps

Small teams running language models in production face a different problem than frontier labs. The question isn’t “how do we push the state of the art” — it’s “how do we get predictable quality, latency, and cost out of a model we didn’t train.”

This post collects the patterns that held up across a handful of projects: picking a base that is slightly smaller than you think you need, investing early in a repeatable evaluation loop, and treating the serving stack as a product surface rather than an afterthought.

The short version: most of the wins come from boring decisions — sensible batching, a cache you actually trust, and the discipline to say no to features that would push you into a larger model tier.

More posts

Get the newsletter

New AI/ML engineering notes in your inbox when I publish. No spam.

Powered by Buttondown. Unsubscribe any time.