Arsalan Mosenia
Abstract sage-to-olive gradient
RAG · LLMs / 9 min read

Retrieval Is Mostly Data Work

By Arsalan Mosenia · Published
0 claps

A common failure mode for retrieval projects is to spend the first month comparing embedding models and the next three months discovering that the corpus has duplicate documents, inconsistent chunking, and missing metadata.

The embedding choice rarely dominates. What dominates is whether the text you are embedding is the text a user would actually want to find, chunked at a granularity that matches how they ask.

This post walks through the pre-retrieval pipeline that has consistently moved the needle: source normalization, structural chunking, per-chunk context carrying, and query rewriting as a first-class stage rather than a prompt afterthought.

More posts

Get the newsletter

New AI/ML engineering notes in your inbox when I publish. No spam.

Powered by Buttondown. Unsubscribe any time.