Beyond Basic Search

RAG Foundations & Data Pipelines — Session 4

2026-04-26

A. Why Simple Search Fails

When cosine similarity isn’t enough

Agenda

  • A. Why Simple Search Fails — Limitations of basic vector search
  • B. Hybrid Search — Combining keyword and semantic retrieval
  • C. Re-Ranking — The cross-encoder quality gate
  • D. Query Understanding — Preprocessing and intent classification
  • E. Query Expansion — Generating better search queries
  • F. Wrap-up — The complete advanced retrieval pipeline

When Semantic Search Misses

Exact Keywords

“Error 503” or “sku-123”

Vectors capture meaning, sometimes losing exact tokens.

Query Mismatch

User: “How do I fix X?” Doc: “Troubleshooting X”

Semantic distance is larger than expected.

Lost in the Middle

LLMs struggle when the answer is buried in the middle of a long context window.

The Advanced Pipeline

flowchart LR
    A["User Query"] --> B["Query<br/>Processing"]
    B --> C["Hybrid<br/>Retrieval"]
    C --> D{"Top 50<br/>Candidates"}
    D --> E["Re-Ranking<br/>(Cross-Encoder)"]
    E --> F{"Top 5<br/>Relevant"}
    F --> G["LLM<br/>Generation"]

    style A fill:#9B8EC0,stroke:#1C355E,color:#fff
    style B fill:#9B8EC0,stroke:#1C355E,color:#fff
    style C fill:#00C9A7,stroke:#1C355E,color:#fff
    style D fill:#00C9A7,stroke:#1C355E,color:#fff
    style E fill:#FF7A5C,stroke:#1C355E,color:#fff
    style F fill:#FF7A5C,stroke:#1C355E,color:#fff
    style G fill:#1C355E,stroke:#1C355E,color:#fff

The Upgrade Path

Stage What Changes Impact
Basic Vector search only Baseline (varies by corpus)
+ Hybrid Add BM25 keyword search Catches exact matches missed by vectors
+ Re-ranking Cross-encoder scores pairs Meaningfully improves precision
+ Query expansion Multiple search variations Improves recall for ambiguous queries
+ Caching Store repeated patterns Reduces latency for repeated queries

Exact gains depend on corpus, query distribution, and embedding model. Measure on your own golden dataset.

Sparse vs Dense

Sparse Vectors (BM25/Keyword)

  • Exact token matching
  • Great for: error codes, IDs, proper nouns
  • Fast and interpretable
  • Misses synonyms entirely
"Error 503" → matches "Error 503"
           → misses "server unavailable"

Dense Vectors (Embeddings)

  • Semantic understanding
  • Great for: intent, synonyms, context
  • Captures meaning relationships
  • May miss exact tokens
"Error 503" → matches "server unavailable"
           → might miss "Error 503" literally

BM25 in 30 Seconds

BM25 (Best Match 25) is a probabilistic ranking function for keyword search.

  • Scores documents by term frequency and inverse document frequency
  • Well-understood, battle-tested since the 1990s
  • Available in Elasticsearch, Whoosh, or rank-bm25 Python package
  • Complements vector search perfectly

RRF Formula + Example

Reciprocal Rank Fusion (RRF) merges ranked lists:

\[\text{RRF}(d) = \sum_{r \in \text{rankers}} \frac{1}{\text{rank}_r(d) + k}\]

Document Semantic Rank Keyword Rank RRF Score (k=60)
Doc A 1 3 1/61 + 1/63 = 0.0326
Doc B 2 1/62 = 0.0161
Doc C 3 1 1/63 + 1/61 = 0.0322

Documents that score well in both systems rise to the top.

RRF Visual

Code: Hybrid Retrieval

def reciprocal_rank_fusion(results_lists, k=60):
    """Combine multiple ranked lists using RRF."""
    fused_scores = {}
    doc_map = {}

    for result_list in results_lists:
        for rank, doc in enumerate(result_list):
            doc_id = doc['id']
            if doc_id not in doc_map:
                doc_map[doc_id] = doc
            prev = fused_scores.get(doc_id, 0.0)
            fused_scores[doc_id] = prev + 1/(rank + k)

    # Sort by fused score (descending)
    ranked = sorted(fused_scores.items(),
                    key=lambda x: x[1], reverse=True)
    return [doc_map[doc_id] for doc_id, _ in ranked]

C. Re-Ranking

The quality gate

Bi-Encoder vs Cross-Encoder

Aspect Bi-Encoder (Embedding) Cross-Encoder (Re-ranker)
Mechanism Encodes Query & Doc separately Encodes Query & Doc together
Speed Fast (Pre-compute embeddings) Slow (Must score each pair)
Accuracy Good (Semantic overlap) Excellent (Deep interaction)
Scale / Use Millions of docs (First stage) Top 50–100 only (Re-ranking)

The Two-Stage Pattern

sequenceDiagram
    participant Query
    participant BiEncoder as Bi-Encoder<br/>(Fast, Broad)
    participant CrossEncoder as Cross-Encoder<br/>(Slow, Accurate)
    participant LLM

    Query->>BiEncoder: "What is FlashAttention?"
    BiEncoder-->>BiEncoder: Top 50 candidates (fast)
    BiEncoder->>CrossEncoder: 50 (query, doc) pairs
    CrossEncoder-->>CrossEncoder: Score each pair
    CrossEncoder->>LLM: Top 5 re-ranked results

Code: Reranker

class AdvancedRetriever:
    def __init__(self, vector_store,
                 reranker_model="BAAI/bge-reranker-base"):
        self.vector_store = vector_store
        self.reranker = CrossEncoder(reranker_model)

    def rerank(self, query, candidates, top_k=5):
        """Re-rank candidates using Cross-Encoder."""
        pairs = [[query, doc['content']]
                 for doc in candidates]

        scores = self.reranker.predict(pairs)
        for i, doc in enumerate(candidates):
            doc['relevance_score'] = float(scores[i])

        ranked = sorted(candidates,
                       key=lambda x: x['relevance_score'],
                       reverse=True)
        return ranked[:top_k]

Complete AdvancedRetriever

def retrieve(self, query, top_k=5, candidates=50):
    """Full pipeline: Hybrid → RRF → Re-rank → Top K"""

    # 1. Semantic search (dense)
    semantic = self._retrieve_semantic(query, candidates)

    # 2. Keyword search (sparse)
    keyword = self._retrieve_keyword(query, candidates)

    # 3. Fuse with RRF
    if keyword:
        hybrid = reciprocal_rank_fusion(
            [semantic, keyword])
    else:
        hybrid = semantic

    # 4. Re-rank top candidates
    return self.rerank(query, hybrid[:candidates], top_k)

Late Interaction Models (e.g., ColBERT)

A powerful middle ground between Bi-Encoders and Cross-Encoders.

How it Works

  • Encodes queries and documents into multiple embeddings (token-level) rather than a single vector.
  • Computes “MaxSim” (maximum similarity) between query tokens and document tokens during retrieval.

The Tradeoff

  • Pre-computable: Document embeddings are calculated offline (like Bi-Encoders).
  • Nuanced: Captures fine-grained word relationships (approaching Cross-Encoders).
  • Cost: Requires significantly more storage (one embedding per token).

D. Query Understanding

Making queries retrieval-friendly

Intent Classification

Instead of fragile keyword matching, use an LLM Classifier with Structured Output.

from enum import Enum
from pydantic import BaseModel

class Intent(str, Enum):
    FACTUAL = "factual"
    COMPARISON = "comparison"
    TROUBLESHOOTING = "troubleshooting"

class QueryAnalysis(BaseModel):
    intent: Intent
    is_complex: bool
    suggested_strategy: str

# LLM call with response_format
analysis = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": user_query}],
    response_format=QueryAnalysis
)

Semantic Routing

For high-traffic systems, use Embeddings to route queries without an LLM call.

How it Works - Create pre-defined “Route Embeddings” for each category. - Compare user query embedding to Route embeddings. - If similarity > threshold, fire that route.

Why use it? - Latency: Sub-10ms routing vs 500ms+ for LLM. - Cost: Fraction of a cent per 1k queries. - Tools: semantic-router (by Aurelio AI) or simple vector similarity.

Query Translation (Text-to-Filter)

Modern RAG doesn’t just search text—it generates Structured Filters.

User: “Show me the 2024 financial reports for the EMEA region.”

LLM Translation (Structured Output):

class SearchFilters(BaseModel):
    year: int
    region: str
    document_type: str

# The LLM extracts entities into this schema:
# { "year": 2024, "region": "EMEA", "document_type": "financial_report" }

This allows highly precise retrieval by combining Vector Search with Metadata Filtering.

E. Query Expansion

Capturing what the user really means

Multi-Query Generation

Sometimes the user asks the wrong question. Use an LLM to generate variations:

Original: “What is FlashAttention?”

Expanded:

  1. “What is FlashAttention?”
  2. “FlashAttention algorithm methodology”
  3. “FlashAttention vs standard attention performance”
  4. “IO-aware attention mechanism”

Each variation may retrieve different relevant chunks. Merge and deduplicate the results.

Code: Multi-Query

def generate_multi_queries(original_query, n=3):
    """Generate query variations using an LLM."""
    prompt = f"""Generate {n} different search queries
    that would help answer: "{original_query}"
    from a database of technical research papers.
    Provide only the queries, one per line."""

    response = llm.complete(prompt)
    queries = [original_query]  # Always include original
    queries.extend(response.strip().split('\n'))
    return queries[:n + 1]

Query Decomposition

Complex questions contain multiple sub-questions — search for each separately.

flowchart LR
    A["Complex Query"] --> B["Sub-Query 1"]
    A --> C["Sub-Query 2"]
    A --> D["Sub-Query 3"]
    B --> E["Results 1"]
    C --> F["Results 2"]
    D --> G["Results 3"]
    E & F & G --> H["Merge +<br/>Deduplicate"]
    H --> I["LLM<br/>Generation"]

    style A fill:#9B8EC0,stroke:#1C355E,color:#fff
    style B fill:#00C9A7,stroke:#1C355E,color:#fff
    style C fill:#00C9A7,stroke:#1C355E,color:#fff
    style D fill:#00C9A7,stroke:#1C355E,color:#fff
    style E fill:#FF7A5C,stroke:#1C355E,color:#fff
    style F fill:#FF7A5C,stroke:#1C355E,color:#fff
    style G fill:#FF7A5C,stroke:#1C355E,color:#fff
    style H fill:#9B8EC0,stroke:#1C355E,color:#fff
    style I fill:#1C355E,stroke:#1C355E,color:#fff

When to use: Multi-faceted questions, comparison queries, and “explain with examples” requests.

Query Decomposition Strategies

Strategy How It Works Best For
Parallel Split into independent sub-queries, retrieve simultaneously Comparison questions (“A vs B”)
Sequential Each result informs the next query Chain-of-thought, multi-hop reasoning
Hierarchical Topic → sub-topic → detail Broad questions requiring depth

Decomposition in Action

Before: “Compare chunking strategies and their impact on retrieval accuracy”

After (3 targeted sub-queries):

  1. “What are the different chunking strategies?”
  2. “How does chunk size affect retrieval precision?”
  3. “Which chunking method works best for technical documents?”

Each sub-query targets a specific facet — improving recall without overloading any single query.

Lost in the Middle

The Problem

LLMs perform worse when the relevant answer is in the middle of the context window. They attend best to the beginning and end.

Ordering Strategies:

Strategy How It Works When to Use
Score-descending Best first Default, general purpose
Interleaved Best at start and end Long context windows
Reverse Best at the end Short, focused contexts

Caching

class QueryCache:
    """In-memory cache for repeated query patterns."""
    def __init__(self, max_size=1000):
        self._cache = {}

    def _hash_query(self, query):
        normalized = query.lower().strip()
        return hashlib.md5(normalized.encode()).hexdigest()

    def get(self, query):
        return self._cache.get(self._hash_query(query))

    def set(self, query, results):
        if len(self._cache) >= self._max_size:
            oldest = next(iter(self._cache))
            del self._cache[oldest]
        self._cache[self._hash_query(query)] = results

In production, use Redis or Memcached with TTL-based expiration.

F. Wrap-up

The complete advanced pipeline

Complete Pipeline

flowchart LR
    A["Raw Query"] --> B["Preprocess<br/>& Expand"]
    B --> C["Semantic<br/>Search"]
    B --> D["Keyword<br/>Search"]
    C & D --> E["RRF<br/>Fusion"]
    E --> F["Cross-Encoder<br/>Re-rank"]
    F --> G["Top-K to<br/>LLM"]

    style A fill:#9B8EC0,stroke:#1C355E,color:#fff
    style B fill:#9B8EC0,stroke:#1C355E,color:#fff
    style C fill:#00C9A7,stroke:#1C355E,color:#fff
    style D fill:#FF7A5C,stroke:#1C355E,color:#fff
    style E fill:#9B8EC0,stroke:#1C355E,color:#fff
    style F fill:#FF7A5C,stroke:#1C355E,color:#fff
    style G fill:#1C355E,stroke:#1C355E,color:#fff

Before vs After

Basic Retrieval

  • Single vector search
  • No keyword fallback
  • No quality gate
  • No query understanding
  • Lower recall & precision (typical without tuning)

Advanced Retrieval

  • Hybrid search (BM25 + vectors)
  • RRF fusion of ranked lists
  • Cross-encoder re-ranking
  • Query expansion + caching
  • Measurably higher recall & precision

These comparisons are illustrative. Actual values depend on your corpus, embedding model, and query distribution. Measure using your own golden dataset (Session 5).

Key Takeaways

  1. Simple vector search misses exact keywords and suffers from query mismatch
  2. Hybrid search (BM25 + semantic) catches what either misses alone
  3. RRF fusion elegantly combines ranked lists without tuning weights
  4. Cross-encoder re-ranking is slow but dramatically improves precision
  5. Query preprocessing (expansion, classification) boosts recall significantly
  6. Cache repeated queries to cut latency and cost in production

Up Next

Lab 4: Upgrade the Retriever — implement hybrid search with BM25, add cross-encoder re-ranking, and compare results against the basic pipeline.