Back to Architecture Patterns
AI / DataAdvanced

RAG Pipeline Architecture

Give your LLM access to your data without fine-tuning. RAG bridges the gap between general-purpose language models and domain-specific knowledge.

May 2, 2026
|
2 topics covered
Discuss This Architecture
RAG Pipeline Architecture
AI / Data
Category
Advanced
Complexity
Legal, Healthcare
Industries
2+
Technologies

When You Need This

You want to build an AI assistant that answers questions about your organization's documents — contracts, policies, knowledge bases, product documentation, medical records. Fine-tuning an LLM on your data is expensive, slow, and creates a model that's frozen at the point of training. You need an architecture where the LLM can access up-to-date, domain-specific information at query time, cite its sources, and avoid hallucinating facts that aren't in your documents. RAG (Retrieval-Augmented Generation) is how you get there.

Pattern Overview

RAG augments LLM generation with retrieved context from a knowledge base. At query time, the system converts the user's question into an embedding, searches a vector database for semantically similar document chunks, and includes the most relevant chunks as context in the LLM prompt. This grounds the model's response in actual documents, enables source citation, and keeps the knowledge base updatable without retraining. A production RAG pipeline handles ingestion (parsing, chunking, embedding), retrieval (vector search, reranking, hybrid search), and generation (prompt construction, streaming, guardrails).

Reference Architecture

The architecture has two pipelines. The ingestion pipeline processes documents through parsing (PDF, DOCX, HTML extraction), chunking (semantic or fixed-size with overlap), embedding (via embedding model), and storage (vector database + document store). The query pipeline takes a user question, generates a query embedding, retrieves candidate chunks from the vector database, reranks them for relevance, constructs a prompt with the top chunks as context, and streams the LLM response with source citations.

Core Components
  • Document Ingestion Pipeline: Multi-format parser (Apache Tika, Unstructured, or custom) that extracts text from PDFs, DOCX, HTML, Markdown, and scanned images (OCR). Chunking strategy splits documents into retrievable units — MW defaults to semantic chunking (split at paragraph/section boundaries) with 512-token target size and 50-token overlap
  • Embedding Service: Converts text chunks into vector embeddings. Uses models like OpenAI text-embedding-3-large, Cohere embed-v4, or open-source alternatives (BGE, E5). Batch processing for ingestion, single-query processing for search
  • Vector Database: Stores embeddings with metadata for filtered search. Supports approximate nearest neighbor (ANN) search at scale. See Scalable Vector Database Architecture for production-scale considerations
  • Retrieval & Reranking: Two-stage retrieval — fast ANN search returns top-50 candidates, then a cross-encoder reranker (Cohere Rerank, BGE Reranker, or ColBERT) scores each candidate against the query for precise relevance ranking. Top-5 chunks go to the LLM
  • Hybrid Search: Combines vector (semantic) search with keyword (BM25) search. This catches cases where vector search misses exact terminology (product codes, legal clauses, medical terms) that keyword search handles well. Reciprocal rank fusion merges the two result sets

Design Decisions & Trade-offs

Chunking Strategy: Fixed-Size vs. Semantic vs. Document-Structure
Fixed-size chunking (split every N tokens) is simple but breaks mid-sentence and loses document structure. Semantic chunking (split at natural boundaries — paragraphs, sections, headers) preserves context but produces variable-size chunks. Document-structure chunking (respect the document's hierarchy — chapters, sections, subsections) is best for structured documents like legal contracts or technical manuals. MW defaults to semantic chunking and switches to document-structure for highly formatted sources.
Vector Search vs. Hybrid Search
Pure vector search works well for conversational queries ("how do I handle refunds?") but fails on exact-match queries ("what's clause 7.3.2?"). Hybrid search (vector + BM25 keyword) handles both. MW recommends hybrid search for any domain with specific terminology, codes, or identifiers — which is most enterprise domains. The 10-15% additional complexity is worth the significant relevance improvement.
Reranking: Cross-Encoder vs. None
Cross-encoder reranking adds 100-300ms latency but dramatically improves retrieval precision — we've measured 15-25% improvement in top-5 relevance across legal and healthcare domains. MW includes reranking by default for any RAG system where answer quality matters more than sub-second latency. For chatbots where speed is critical, we skip reranking and compensate with better chunking and prompt engineering.
Single-Vector vs. Multi-Vector (ColBERT-style)
Single-vector embeddings are simpler and cheaper to store/search. Multi-vector representations (one vector per token, late interaction scoring) capture more nuance but require specialized infrastructure. MW uses single-vector for most deployments and reserves multi-vector for domains where retrieval quality is the bottleneck and the document corpus exceeds 100K chunks.
RAG Pipeline Architecture - System Architecture Diagram

System Architecture Overview

Technology Choices

LayerTechnologies
Document ParsingUnstructured, Apache Tika, LlamaParse, Docling, custom OCR (Tesseract, AWS Textract)
EmbeddingOpenAI text-embedding-3-large, Cohere embed-v4, BGE-M3, E5-large-v2
Vector DatabaseMilvus, Pinecone, Qdrant, Weaviate, pgvector (for small-scale)
Keyword SearchElasticsearch, OpenSearch, PostgreSQL full-text search
RerankingCohere Rerank, BGE Reranker, ColBERT v2, FlashRank
LLMClaude (via AI Gateway), GPT-4, Gemini — provider-agnostic via AI SDK
OrchestrationLangChain, LlamaIndex, or custom pipeline (MW preference for production)

When to Use / When to Avoid

Use WhenAvoid When
Users need answers grounded in your organization's specific documentsThe knowledge base is < 50 pages — just put it in the system prompt
Documents are updated frequently and the AI needs current informationYou need the model to learn a new skill/behavior, not access new facts (fine-tune instead)
Source citation and auditability are requirements (legal, compliance, healthcare)The questions are purely conversational and don't require factual grounding
Multiple user groups need access to different document subsets (permission-filtered RAG)You're building a creative writing tool where factual accuracy isn't the goal

Our Approach

MW builds RAG pipelines from the retrieval quality outward — we benchmark retrieval precision before touching the LLM prompt. A RAG system with mediocre retrieval and a great LLM produces confident-sounding wrong answers. Our standard pipeline includes a retrieval evaluation harness: a set of test queries with known-relevant documents, measured by MRR@5 and NDCG@10. We iterate on chunking, embedding model, and reranking until retrieval metrics hit target thresholds before optimizing generation. We've built RAG systems across legal document review, healthcare knowledge bases, and multi-language customer support — and the common lesson is that retrieval quality accounts for 80% of answer quality.

Related Blueprints

Related Industry Guides

  • AI for Legal — RAG applications in contract review and legal research

Related Case Studies

Related Technologies
AI DevelopmentSaaS Development

Need Help Implementing This Architecture?

Our architects can help design and build systems using this pattern for your specific requirements.

Get In Touch
Contact UsSchedule Appointment