About Skills Projects Experience Blog Resume Contact
← All Posts

05 Apr 2026

Building a RAG Pipeline from Scratch: What I Learned Shipping It to Production

RAGLangChainAIPythonProductionGenerative AI

Retrieval-Augmented Generation looks straightforward on paper. You split documents into chunks, embed them into a vector store, retrieve the most relevant ones at query time, and pass them to an LLM. The tutorials make it look like an afternoon project.

Then you try to ship it to real users and everything you thought you understood gets stress-tested.


The Chunking Problem Nobody Talks About

The first thing that broke in production was chunking. Fixed-size chunking — splitting every 500 tokens — destroys context. A sentence that begins in one chunk and ends in the next becomes meaningless to the retriever. Switching to semantic chunking, where splits happen at natural boundaries like paragraph breaks and heading changes, improved retrieval relevance significantly.


Embedding Model Choice Matters More Than You Think

I started with a general-purpose embedding model and retrieval quality was mediocre. The queries were about legal documents and the model had no domain awareness. Switching to a model fine-tuned on legal text made an immediate, measurable difference. The lesson: your embedding model needs to understand your domain, not just language in general.


The Retrieval-Generation Gap

Retrieving the right chunks is only half the problem. The LLM still needs to synthesize them coherently. I found that retrieved chunks often contradicted each other — different documents making different claims about the same statute. Adding a reranking step before generation, and explicitly prompting the model to acknowledge conflicting information, made responses significantly more trustworthy.


What Actually Works in Production

After several iterations the stack that worked was: semantic chunking with overlap, domain-appropriate embeddings, cosine similarity retrieval with a reranker, and a system prompt that instructs the model to cite its sources and flag uncertainty. It is slower than a naive implementation but the output quality justifies it.

RAG is not a feature you add — it is a system you design. The gap between a demo and a production deployment is where most of the real engineering lives.