Production Optimization

RAG Foundations & Data Pipelines — Session 6

2026-04-26

A. Prototype to Production

Scaling your RAG system for real users

Agenda

  • A. Prototype to Production — The three optimization axes
  • B. HNSW Benchmarking — Systematic parameter tuning
  • C. Quantization — Trading precision for massive scale
  • D. Caching Strategies — Semantic and Response caching
  • E. Zero-Downtime Reindexing — Production update strategies
  • F. Cost Planning — Budgeting for vector infrastructure
  • G. Final Architecture — The complete Research Assistant

The Scale Question

From Prototype to Product

Your RAG system works on 1,000 documents. But production means:

  • 100K+ documents growing 10% monthly
  • Thousands of queries/day with p95 < 5 seconds
  • Zero downtime during knowledge base updates
  • Budget approval with a clear cost forecast

Three Optimization Axes

Speed

  • HNSW parameter tuning
  • ef_search optimization
  • Query caching
  • Batch operations

Target: < 50ms per search

Accuracy

  • M and ef_construction
  • Quantization impact
  • Recall@10 monitoring
  • Re-ranking quality

Target: > 0.95 Recall@10

Cost

  • Memory footprint
  • Storage per vector
  • API call pricing
  • Managed vs self-hosted

Target: < $0.10 per query (including LLM)

B. HNSW Benchmarking

Systematic parameter tuning

Benchmarking Framework

def build_and_benchmark_hnsw(embeddings,
    M_list=[16, 32, 64],
    ef_construction_list=[100, 200, 400],
    ef_search_list=[50, 100, 200]):
    """Benchmark HNSW configurations."""

    d = embeddings.shape[1]  # embedding dimension
    results = []

    for M in M_list:
        for ef_con in ef_construction_list:
            index = faiss.IndexHNSWFlat(d, M)
            index.hnsw.efConstruction = ef_con
            index.add(embeddings)

            for ef_search in ef_search_list:
                index.hnsw.efSearch = ef_search
                # Benchmark: time + recall
                ...
                results.append({
                    'M': M, 'ef_con': ef_con,
                    'ef_search': ef_search,
                    'search_ms': avg_ms,
                    'recall': recall_at_10})
    return results

Parameter Grid

Parameter Values Tested Effect
M 16, 32, 64 Graph connectivity (memory)
ef_construction 100, 200, 400 Index build quality (build time)
ef_search 50, 100, 200 Query-time search depth (latency)

Total configurations: 3 x 3 x 3 = 27 experiments

Search Time vs Recall

SOURCE: OpenSource Connections (2025). The Pareto frontier shows the optimal speed-accuracy trade-offs for HNSW.

SOURCE: OpenSource Connections (2025). Diagnostic visualization showing “missing documents” (red dots) at different search depths (k or ef_search), proving how recall improves as search depth increases.

Starting Points

Rules of Thumb

Dataset Size M ef_construction ef_search
< 100K vectors 16 200 50-100
100K - 1M 32 200 100-200
> 1M 32-64 400 200+

You can increase ef_search at query time without rebuilding the index. Start low, increase until latency budget is exhausted.

C. Quantization

Trading precision for massive scale

Why Quantize?

100M vectors × 1536 dimensions × 4 bytes = ~614 GB (or ~572 GiB) just for the vectors.

Quantization reduces vector size by storing with lower precision:

  • Full float32: 4 bytes/dimension → 6 KB per vector
  • SQ8 (8-bit int): 1 byte/dimension → 1.5 KB per vector
  • Binary: 1 bit/dimension → 192 bytes per vector

At 100M vectors, going from float32 to SQ8 saves ~460 GB (~430 GiB) of memory.

Quantization Types

Type Compression Recall Impact Speed Impact
Scalar (SQ8) 4x Minimal (~2%) Faster (cache-friendly)
Product (PQ) 10-100x Moderate (~5-15%) Much faster
Binary 32x Variable (~5-15%) Fastest

Recall impact figures are approximate and vary significantly with dataset size, index parameters, and embedding model. Always validate against your evaluation suite.

FAISS Indexes

# 1. Flat Index (Exact - Baseline)
index_flat = faiss.IndexFlatL2(d)

# 2. HNSW (Approximate, Full Precision)
index_hnsw = faiss.IndexHNSWFlat(d, M=32)

# 3. Scalar Quantization (SQ8)
index_sq8 = faiss.IndexScalarQuantizer(
    d, faiss.ScalarQuantizer.QT_8bit)
index_sq8.train(embeddings)

# 4. IVF + Product Quantization (IVFPQ)
quantizer = faiss.IndexFlatL2(d)
index_ivfpq = faiss.IndexIVFPQ(
    quantizer, d, nlist=1024, m=96, nbits=8)
index_ivfpq.train(embeddings)

Speed / Accuracy / Memory

When to Use What

Scale Recommended Index Why
< 100K HNSW Flat Best accuracy, memory is fine
100K - 1M HNSW + SQ8 4x memory savings, minimal recall loss
1M - 10M IVF + PQ Major memory savings, fast search
> 10M IVF + PQ + OPQ Maximum compression needed

Always A/B Test

Quantization must be validated against your ground truth metrics from the evaluation suite. Never assume recall is “good enough” — measure it.

D. Caching Strategies

Embedding-based similarity matching to skip redundant LLM calls

Response Caching

During development and production, you often see the exact same queries repeated.

LiteLLM ships with built-in exact-match caching — one line to enable it globally:

import litellm

# Enable in-memory caching for all subsequent completion() calls
litellm.enable_cache(type="local")

# Or use Redis in production
# litellm.enable_cache(type="redis", host="localhost", port=6379)

Limitation: Exact match fails if the user adds a single space or changes capitalization.

What is Semantic Caching?

The Key Insight

Two queries can be semantically identical but lexically different: "What is the capital of France?" and "Tell me France's capital city" should return the same cached answer.

How it works:

  1. Embed the incoming query → vector
  2. Search cache for similar vectors (cosine similarity)
  3. If similarity > threshold (e.g., 0.93) → return cached response
  4. Otherwise → call LLM, store result + embedding in cache

Semantic Cache Flow

graph TD
    Q["New Query"] --> E["Embed Query\n(sentence-transformers)"]
    E --> S["Cosine Similarity Search\nvs cache entries"]
    S -->|sim > 0.93| HIT["Cache HIT\nReturn stored response"]
    S -->|sim ≤ 0.93| MISS["Cache MISS"]
    MISS --> LLM["Call LLM API"]
    LLM --> STORE["Store response + embedding\nin cache"]
    STORE --> RESP["Return response"]
    HIT --> RESP

    style HIT fill:#00C9A7,stroke:#1C355E,color:#1C355E
    style MISS fill:#FF7A5C,stroke:#1C355E,color:#1C355E
    style LLM fill:#9B8EC0,stroke:#1C355E,color:#1C355E

Cost Impact of Caching

Scenario Daily Requests Cache Hit Rate Daily Cost Notes
No cache 10,000 0% $150 Baseline
Weak cache (sim > 0.99) 10,000 15% $127.50 Conservative, high precision
Strong cache (sim > 0.93) 10,000 30% $105
Aggressive (sim > 0.90) 10,000 45% $82.50 Risk of semantic mismatches

Threshold Trade-off

Lower threshold → more hits, but risk of semantically different queries getting wrong answers.

E. Zero-Downtime Reindexing

Updating production without breaking it

Two Strategies

Incremental Updates

  • Upsert new documents
  • Index is not re-optimized
  • Works for small, regular additions
  • Quality can degrade over time
index.upsert(new_vectors)

Alias-Switching (Full Reindex)

  • Build a brand new index
  • Backfill all data
  • Swap pointer atomically
  • Delete old index

Zero-downtime, always optimal

Alias-Switching

sequenceDiagram
    participant App as Application
    participant Old as Index v1 (Live)
    participant New as Index v2 (Building)
    participant Alias as Alias Config

    App->>Alias: Read: "live = v1"
    App->>Old: Serve queries
    Note over New: Build new index<br/>Backfill all data
    New->>New: Verify data integrity
    App->>Alias: Write: "live = v2"
    App->>New: Serve queries (instant switch)
    Note over Old: Delete after cooldown

Code: PineconeReindexer

class PineconeReindexer:
    def execute_zero_downtime_update(self, chunks):
        # 1. Generate unique name
        timestamp = datetime.utcnow().strftime("%Y%m%d%H%M")
        new_name = f"research-assistant-{timestamp}"

        # 2. Create & Populate new index
        self.create_index(new_name)
        self.backfill_data(new_name, chunks)

        # 3. Get current live index
        old_name = self.alias_manager.get_live_index()

        # 4. ATOMIC SWITCH
        self.alias_manager.set_live_index(new_name)

        # 5. Cleanup old index (after cooldown)
        if old_name and old_name != new_name:
            self.pc.delete_index(old_name)

Safety Checklist

Before Switching Live

  1. New index has same or more documents than old
  2. Spot-check 10 known queries produce correct results
  3. Eval metrics (Hit Rate, MRR) meet thresholds
  4. Rollback plan — keep old index for 24h before deletion
  5. Test on staging first — never go straight to production

F. Cost Planning

Budgeting for vector infrastructure

Cost Formula

def estimate_monthly_cost(config, num_vectors,
                          reads_per_month, writes_per_month):
    # Vector storage
    bytes_per_vector = config.dimension * 4  # float32
    storage_gb = (num_vectors * bytes_per_vector) / 1e9

    # Metadata storage
    meta_gb = (num_vectors * config.meta_bytes) / 1e9

    # Total cost
    storage_cost = (storage_gb + meta_gb) * config.$/gb
    read_cost = (reads_per_month / 1e6) * config.$/M_reads
    write_cost = (writes_per_month / 1e6) * config.$/M_writes

    return config.base_fee + storage_cost + read_cost + write_cost

Cost Forecast

Reduction Strategies

Strategy Savings Trade-off
Quantization (SQ8) 75% storage Small recall loss
Larger chunks Fewer vectors Less precision
Metadata pruning Up to 50% metadata storage Less filtering capability
Tiered storage 50%+ Cold data has higher latency
Caching Varies by query distribution Stale results for rare queries

G. Final Architecture

The Research Assistant Picture

A System Architecture

flowchart LR
    User["User Query"] --> QP["Query Processor"]
    QP -->|"Expanded Queries"| AR["Advanced Retriever"]
    AR -->|"Hybrid Search"| VS[("Vector Store")]
    AR -->|"BM25 Search"| KS[("Keyword Index")]
    VS & KS --> Cands["Candidate Pool"]
    Cands --> RR["Cross-Encoder<br/>Re-ranker"]
    RR -->|"Top K"| CM["Context Manager"]
    CM -->|"Pruned Context"| Gen["LLM Generation"]
    Gen -->|"Answer + Citations"| User

    subgraph Offline Pipeline
        Ingest["Ingestion"] --> VS
        Eval["DeepEval Suite"] -.->|"Grades"| Gen
    end

    style User fill:#9B8EC0,stroke:#1C355E,color:#fff
    style QP fill:#9B8EC0,stroke:#1C355E,color:#fff
    style AR fill:#00C9A7,stroke:#1C355E,color:#fff
    style VS fill:#1C355E,stroke:#1C355E,color:#fff
    style KS fill:#1C355E,stroke:#1C355E,color:#fff
    style RR fill:#FF7A5C,stroke:#1C355E,color:#fff
    style CM fill:#FF7A5C,stroke:#1C355E,color:#fff
    style Gen fill:#FF7A5C,stroke:#1C355E,color:#fff
    style Ingest fill:#00C9A7,stroke:#1C355E,color:#fff
    style Eval fill:#9B8EC0,stroke:#1C355E,color:#fff

ResearchAssistantService

class ResearchAssistantService:
    """Production entry point — integrates all components."""

    def __init__(self, vector_store, llm_client):
        self.query_processor = QueryProcessor()
        self.retriever = AdvancedRetriever(vector_store)
        self.context_manager = ContextManager(max_tokens=3000)
        self.llm = llm_client

Integrates:

  • Query Understanding (Session 4)
  • Hybrid Retrieval & Re-ranking (Session 4)
  • Context Optimization (Session 5)
  • Citation Generation (Session 5)

The answer() Method

async def answer(self, raw_query):
    # 1. Query Processing
    queries = self.query_processor.expand(raw_query)

    # 2. Hybrid Retrieval (per variation)
    all_candidates = []
    for q in queries:
        results = self.retriever.retrieve(q, candidates=50)
        all_candidates.extend(results)

    # 3. Deduplicate + Global Re-rank
    unique = {doc['id']: doc for doc in all_candidates}
    final = self.retriever.rerank(
        raw_query, list(unique.values()), top_k=5)

    # 4. Context Management + Generation
    context = self.context_manager.format(final)
    response = await self.llm.generate(
        system=CITATION_PROMPT,
        user=f"Context:\n{context}\n\nQ: {raw_query}")

    return {"answer": response, "citations": final}

Agentic RAG — Beyond Single-Shot

RAG architectures can be viewed as increasing levels of sophistication:

Level Pattern What It Adds
Basic Single-Shot RAG One query → retrieve → generate
Enhanced Multi-Query RAG Expand queries, fuse results
Self-Correcting Corrective RAG Evaluate → recover from bad retrieval
Autonomous Agentic RAG Plan → act → observe → loop

Agentic RAG adds: planning, tool use, multi-step loops, and self-correction across turns.

Agentic RAG Architecture

flowchart LR
    U["User Goal"] --> A["Agent<br/>Planner"]
    A --> P["Plan<br/>Steps"]
    P --> R["Retrieve<br/>Evidence"]
    R --> O["Observe<br/>Results"]
    O --> D{"Goal<br/>Met?"}
    D -- "No" --> A
    D -- "Yes" --> S["Synthesize<br/>Answer"]

    style U fill:#9B8EC0,stroke:#1C355E,color:#fff
    style A fill:#9B8EC0,stroke:#1C355E,color:#fff
    style P fill:#00C9A7,stroke:#1C355E,color:#fff
    style R fill:#00C9A7,stroke:#1C355E,color:#fff
    style O fill:#FF7A5C,stroke:#1C355E,color:#fff
    style D fill:#FF7A5C,stroke:#1C355E,color:#fff
    style S fill:#1C355E,stroke:#1C355E,color:#fff

Production Checklist

Component Status Metric
Ingestion pipeline Handles multi-format (PDF, web) Coverage
Chunking Factory pattern, configurable Avg size 500-1000
Vector store HNSW tuned, persistent Recall@10 > 0.95
Hybrid retrieval BM25 + semantic + RRF Precision@5 > 0.80
Re-ranking Cross-encoder quality gate Relevance score > 0.7
Citations [ID] format, context-pruned Faithfulness > 0.8
Evaluation Golden dataset, CI/CD gates Automated regression
Reindexing Zero-downtime alias switching No user impact

Key Takeaways

  1. Benchmark HNSW parameters systematically — M, ef_construction, ef_search
  2. Quantization (SQ8, PQ) enables scaling to millions of vectors with 4-30x compression
  3. Zero-downtime reindexing via alias-switching keeps production running during updates
  4. Cost planning with growth forecasts justifies infrastructure decisions
  5. The complete RAG pipeline integrates ingestion, retrieval, generation, and evaluation
  6. Production is a mindset — monitoring, evaluation, and continuous improvement

Up Next

Lab 6: Production Readiness — benchmark your index, implement quantization, run the full evaluation suite, and finalize the Research Assistant.