Tech June 15, 2025 3 min read

Building RAG Systems That Actually Work

Lessons learned from deploying RAG-powered assistants across universities, corporations, and developer communities.

RAG LLM Python AI Google Cloud

Introduction

When I first started building RAG (Retrieval-Augmented Generation) systems, the tutorials made it look straightforward: chunk your documents, embed them, store in a vector database, retrieve, and generate. In practice, every one of those steps hides a dozen decisions that make or break the user experience.

Over the past two years, I’ve deployed RAG systems for university learning platforms, developer support communities, legal document analysis, and enterprise knowledge bases. Here’s what I’ve learned about building systems that actually work.

Chunking Strategies That Matter

The single most impactful decision in a RAG system is how you chunk your documents. Get it wrong, and no amount of clever retrieval will save you.

Fixed-Size vs. Semantic Chunking

Fixed-size chunking (e.g., 500 tokens with 50-token overlap) is the default in most tutorials. It’s simple, predictable, and often good enough for homogeneous content like documentation.

But for real-world content – legal contracts, academic papers, mixed-format documents – semantic chunking dramatically improves retrieval quality. I use a combination approach:

Heading-based splitting for structured documents
Sentence-boundary chunking with semantic similarity thresholds
Metadata-enriched chunks that carry their parent document context

The Overlap Problem

Too little overlap and you lose context at chunk boundaries. Too much and you waste tokens on redundant content. I’ve found 10-15% overlap works well for most use cases, but the real solution is hierarchical retrieval – store both large context windows and fine-grained chunks.

Embedding Model Selection

Not all embedding models are created equal. After extensive testing across multiple projects:

For English-only content: OpenAI’s text-embedding-3-large remains hard to beat
For multilingual content: Cohere’s embed-multilingual-v3 handles cross-language retrieval remarkably well
For cost-sensitive deployments: Open-source models like bge-large-en-v1.5 on Google Cloud Run offer excellent price-performance

The key insight: always benchmark on YOUR data. Generic benchmarks (MTEB, etc.) don’t predict performance on domain-specific content.

Retrieval Techniques

Basic cosine similarity search gets you 60% of the way. Here’s what gets you the rest:

Hybrid Search

Combine dense (vector) and sparse (BM25/keyword) retrieval. Some queries need exact keyword matching that embeddings miss – product codes, error messages, specific names.

Re-ranking

A cross-encoder re-ranker on your top-20 results dramatically improves precision. Cohere’s re-rank API is the easiest to integrate; for self-hosted, cross-encoder/ms-marco-MiniLM-L-12-v2 works well.

Query Transformation

Before retrieval, expand or rephrase the user’s query. A simple LLM call to generate 2-3 alternative phrasings and then retrieve for all of them catches queries that would otherwise miss relevant content.

Conclusion

Building RAG systems that work in production requires attention to detail at every step of the pipeline. The difference between a demo and a production system is chunking strategy, retrieval sophistication, and relentless evaluation on real user queries. Start simple, measure everything, and iterate based on actual failure cases.

Get in Touch

Have questions about this topic? Let's discuss.

Email

kris.lukacs@gmail.com

Phone

+44 7518 571553

Location

London, United Kingdom

LinkedIn GitHub