Retrieval Is Mostly Data Work
A common failure mode for retrieval projects is to spend the first month comparing embedding models and the next three months discovering that the corpus has duplicate documents, inconsistent chunking, and missing metadata.
The embedding choice rarely dominates. What dominates is whether the text you are embedding is the text a user would actually want to find, chunked at a granularity that matches how they ask.
This post walks through the pre-retrieval pipeline that has consistently moved the needle: source normalization, structural chunking, per-chunk context carrying, and query rewriting as a first-class stage rather than a prompt afterthought.