RAG Architecture & Foundations

RAG Foundations & Data Pipelines — Session 1

2026-04-26

A. The RAG Paradigm

Grounding language models in real knowledge

Agenda

  • A. The RAG Paradigm — What RAG is and why it matters
  • B. How Retrieval Works — Embeddings and vector similarity
  • C. RAG Architecture — The three pillars in action
  • D. The Case for RAG — Strengths, limitations, and trade-offs
  • E. The Decision Framework — RAG vs fine-tuning vs prompting
  • F. Embedding Models — Choosing your engine

The Problem RAG Solves

Without RAG

  • LLM knowledge is frozen at training time
  • Cannot access private or recent data
  • No source attribution — “trust me”
  • Hallucinations are undetectable
User → LLM → (best guess)

With RAG

  • Dynamic, up-to-date knowledge
  • Grounded in your documents
  • Every claim has a source
  • Hallucinations are measurable
User → Retrieve → Augment → Generate

Why Keyword Search Fails

Scenario: You’re searching YouTube video transcripts for:

“how to fix a bike”

Video 1

“In this tutorial, I’ll show you how to fix a bike chain that keeps falling off.”

  • Result: ✅ Found
  • Why: Exact keyword match

Video 2

Bicycle repair tutorial: learn to repair a broken chain in 5 minutes.”

  • Result:Missed
  • Why: No shared keywords

The Problem

Same idea, different words — keyword search cannot understand that “fix a bike” and “bicycle repair” mean the same thing. This is where embeddings come in.

B. How Retrieval Works

The bridge between language and math

What Are Embeddings?

An embedding is a numerical representation of text — a high-dimensional vector that captures semantic meaning.

Key Properties

  • Similar text → nearby vectors
  • “Dog” ≈ “Puppy” in vector space
  • king - man + woman ≈ queen
  • Meaning encoded as geometry

Why It Matters for RAG

  • Query → vector → find similar chunks
  • Understands synonyms automatically
  • Goes beyond keyword matching
  • Foundation of semantic search

Semantic Clusters

Related concepts form tight clusters. Your retriever finds chunks whose vectors are “close” to the query vector.

Vector Arithmetic

Beyond Text: Multimodal Embeddings

Embeddings can represent any data where “similarity” matters:

  • Images — Visual similarity and search
  • Video — Action and scene recognition
  • Audio — Speech patterns and music
  • Multimodal (CLIP) — Connecting text to images
  • And others …

High-Dimensional Reality

Aspect Visualization Reality
Dimensions 2-D scatter plot 768 – 3072 dimensions
Distances Euclidean on screen Cosine in practice
Clusters Visually obvious Exist in high-dim subspaces
Precision Approximate Exact within model’s capacity

Key Insight

We visualize in 2-D for intuition, but the real power of embeddings lives in hundreds or thousands of dimensions where subtle semantic distinctions are captured.

Cosine Similarity Explained

Measures the angle between two vectors, not the distance.

\[\text{cosine_sim}(A, B) = \frac{A \cdot B}{\|A\| \times \|B\|}\]

  • 1.0 = identical direction (same meaning)
  • 0.0 = orthogonal (unrelated)
  • -1.0 = opposite direction

Focuses on direction rather than magnitude — a short summary and a long document on the same topic score high.

Cosine vs Euclidean

Why Cosine for Text?

Cosine Similarity

  • Ignores vector length
  • “Dog” (short) ≈ “Dogs are loyal companions” (long)
  • Robust to document length variation
  • Standard for embedding APIs

Euclidean Distance

  • Penalizes different magnitudes
  • Long doc vs short doc = “far apart”
  • Better for spatial data (geolocation)
  • Requires normalized vectors for text

Bottom Line

For text embeddings, prefer cosine similarity by default. Some models (e.g., OpenAI text-embedding-3) are trained for dot product — check your model’s documentation.

Metrics Cheat Sheet

Metric Range Best For RAG Default?
Cosine Similarity [-1, 1] Text, normalized vectors Yes
Euclidean (L2) [0, ∞) Spatial data, images No
Dot Product (-∞, ∞) Pre-normalized embeddings Sometimes
Manhattan (L1) [0, ∞) Sparse vectors, BM25 No

C. RAG Architecture

Bringing it all together

Core Architecture

flowchart LR
    subgraph online["Query Pipeline (Online — per request)"]
        A[User Query] --> B[Query Embedding]
        B --> C[Vector Store<br/>Retrieval]
        C --> D[Top-K Context<br/>Chunks]
        D --> E[Orchestrator<br/>Formats Prompt]
        E --> F[LLM Generator]
        F --> G[Answer +<br/>Source Citations]
    end

    subgraph offline["Indexing Pipeline (Offline — run once)"]
        H[Raw Documents] --> I[Chunking &<br/>Cleaning]
        I --> J[Generate<br/>Embeddings]
        J --> K[(Vector Store)]
    end

    C -.-> K

    style A fill:#9B8EC0,stroke:#1C355E,color:#fff
    style B fill:#00C9A7,stroke:#1C355E,color:#fff
    style C fill:#00C9A7,stroke:#1C355E,color:#fff
    style D fill:#00C9A7,stroke:#1C355E,color:#fff
    style E fill:#FF7A5C,stroke:#1C355E,color:#fff
    style F fill:#FF7A5C,stroke:#1C355E,color:#fff
    style G fill:#1C355E,stroke:#1C355E,color:#fff
    style H fill:#9B8EC0,stroke:#1C355E,color:#fff
    style I fill:#9B8EC0,stroke:#1C355E,color:#fff
    style J fill:#9B8EC0,stroke:#1C355E,color:#fff
    style K fill:#1C355E,stroke:#1C355E,color:#fff

“The quality of your answers can never exceed the quality of your retrieval.”

D. The Case for RAG

Why RAG dominates enterprise AI

Aspect Prompting RAG Fine-Tuning
Factuality Low (Hallucinates) High (Grounded) Moderate
Adaptability High (Instant) High (Instant) Low (Requires retraining)
Setup Cost Lowest Moderate High
Style Control Low Low Highest
Citation No Yes No

SOURCE: Based on industry consensus (Hotz, 2023; Microsoft Research, 2024). RAG is the optimal choice for knowledge-intensive tasks requiring high factuality and source attribution.

RAG in the Wild

Enterprise Search

  • Internal wikis, policy docs, runbooks
  • Employees ask in natural language
  • Every answer links to the source page

Legal & Compliance

  • Contracts, regulations, case law
  • Precise citations are non-negotiable
  • Auditable reasoning trail

Customer Support

  • Product manuals, FAQs, ticket history
  • Handles edge cases without retraining
  • Escalates when confidence is low

Research Assistants

  • Live corpus of papers and reports
  • Synthesises across hundreds of docs
  • Reproducible with cited passages

Limitations

1. Latency Overhead — Multiple network hops: embed → search → generate. Adds hundreds of ms.

2. Retrieval Dependency — “Garbage in, garbage out.” Wrong chunks → wrong answers.

3. Over-Retrieval — Too many chunks = context dilution, higher cost, confused LLM.

4. The Chunking Problem — Bad splits destroy meaning. Finding optimal strategy is hard.

Trade-off Curve

Sweet Spot

The goal is to stay in the shaded zone — enough complexity for quality, not so much that latency suffers.

E. The Decision Framework

When to choose RAG, fine-tuning, or prompting

Decision Tree

flowchart LR
    Start[New AI Feature] --> Q1{Need dynamic/private<br/>knowledge or citations?}
    Q1 -- Yes --> RAG[Choose RAG]
    Q1 -- No --> Q2{Highly specialized task?<br/>Style, tone, code gen?}
    Q2 -- Yes --> Q3{Stable task +<br/>training data available?}
    Q3 -- Yes --> FT[Choose Fine-Tuning]
    Q3 -- No --> Prompt[Advanced Prompting]
    Q2 -- No --> Q4{Simple general task?<br/>No private data?}
    Q4 -- Yes --> Prompt
    Q4 -- No --> RAG

    style RAG fill:#00C9A7,stroke:#1C355E,color:#fff
    style FT fill:#FF7A5C,stroke:#1C355E,color:#fff
    style Prompt fill:#9B8EC0,stroke:#1C355E,color:#fff
    style Start fill:#1C355E,stroke:#1C355E,color:#fff

When to Choose Each

RAG

  • Private/dynamic documents
  • Source citations required
  • Broad knowledge domain
  • Frequent data updates

Example: Research assistant, helpdesk

Fine-Tuning

  • Change model behavior/style
  • Stable, narrow task
  • Quality labeled dataset
  • Latency-critical path

Example: Brand voice, code gen

Prompting

  • Simple, general tasks
  • No private data needed
  • Prototype in minutes
  • Model’s knowledge suffices

Example: Summarization, classification

Quick-Reference Table

Factor RAG Fine-Tuning Prompting
Knowledge External, dynamic Baked into weights Model’s training data
Update Cost Add docs (cheap) Retrain (expensive) Edit prompt (free)
Citation Built-in Not available Not available
Latency Higher (retrieval) Lowest Low
Best For Knowledge-intensive QA Style/behavior change Simple tasks

Applied: Research Assistant

Requirement Analysis Verdict
Dynamic/private knowledge? Yes — growing corpus of PDFs RAG
Source attribution? Yes — academic citations RAG
Task stable & narrow? No — open-ended questions Not fine-tuning
Training data available? No QA pairs Not fine-tuning

Verdict

RAG is the clear and necessary choice for our Research Assistant.

F. Embedding Models

Choosing the right engine

The Model Landscape

graph LR
    A["Proprietary APIs<br/>(OpenAI, Cohere)"] --> D[Your RAG System]
    B["Open Source Local<br/>(MiniLM, BGE, E5)"] --> D
    C["Open Source Hosted<br/>(HF Endpoints, Jina)"] --> D

    style A fill:#FF7A5C,stroke:#1C355E,color:#fff
    style B fill:#00C9A7,stroke:#1C355E,color:#fff
    style C fill:#9B8EC0,stroke:#1C355E,color:#fff
    style D fill:#1C355E,stroke:#1C355E,color:#fff

Comparison Table

Type Examples Accuracy Cost Privacy Latency
Proprietary text-embedding-3-small High Per-token Data leaves API-bound
Open Local all-MiniLM-L6-v2, bge-base Good Free Full control Hardware-bound
Open Hosted HF Endpoints, Jina Good Per-request Moderate API-bound

Our Strategy

Start with OpenAI text-embedding-3-small for development (cheap, fast, reliable). Evaluate open-source models on your data for production cost savings.

Key Takeaways

  1. RAG = Retrieve + Augment + Generate — ground LLMs in external knowledge
  2. Three pillars: Retriever (search), Generator (LLM), Orchestrator (glue)
  3. Use RAG when you need dynamic knowledge, citations, or broad coverage
  4. Embeddings convert text into vectors that capture semantic meaning
  5. Cosine similarity is the standard metric for text — direction, not magnitude
  6. Model choice depends on accuracy needs, budget, and privacy requirements

Up Next

Lab 1: Build the Document Ingestion Pipeline — extracting and cleaning text from PDFs and web pages for our Research Assistant.