Retrieval-augmented generation (RAG) has become the default architecture for grounding large language models in enterprise data. But the gap between a working prototype and a production-grade system that handles 10M+ documents reliably is vast.
This post covers the architectural decisions we make when building RAG systems for enterprise clients — from vector store selection to chunking strategies, re-ranking pipelines, and query routing.
Chunking strategy matters more than you think
Most RAG tutorials use fixed-size chunking (e.g., 512 tokens with 50-token overlap). In production, this approach breaks down because it ignores semantic boundaries. We use semantic chunking combined with a document-aware splitter that respects headings, sections, and paragraph structures.
For structured documents like financial reports or legal contracts, hierarchical chunking works well — preserving parent-child relationships between sections and sub-sections allows more precise retrieval.
Vector store selection at scale
For under 1M documents, Pinecone or Weaviate work well with default configurations. Above that, you need to think about index partitioning, approximate nearest neighbor (ANN) algorithm selection, and query latency under concurrent load.
We typically use pgvector for clients who need tight PostgreSQL integration, and Pinecone serverless for pure vector search workloads. The key is benchmarking retrieval latency under your actual query distribution — not synthetic benchmarks.
Re-ranking is not optional
First-pass retrieval using vector similarity is fast but imprecise. Adding a cross-encoder re-ranker (e.g., Cohere's reranker or a fine-tuned BERT model) in the retrieval pipeline improves answer quality measurably. We've seen accuracy improvements of 15-30% on enterprise Q&A tasks.
Query routing and fallback strategies
Complex enterprise deployments often combine multiple data sources: internal documents, databases, APIs, and web content. A query router that classifies user intent and directs queries to the appropriate retrieval backend dramatically improves response relevance.
Always implement a fallback: if the RAG system returns low-confidence results, route to a human escalation path or a more general LLM response rather than confabulating.