RAG on Databricks: Embeddings, Vector Search, and Cost/Latency Tuning

A practical guide to building Retrieval-Augmented Generation on Databricks.

Pipeline overview

flowchart LR
  A[Docs/Transcripts] --> B[Chunk & Clean]
  B --> C["Embed
          GTE Large EN v1.5"]
  C --> D["Vector Index
          Vector Search"]
  D --> E["Retrieve"]
  E --> F["Compose Prompt"]
  F --> G["LLM Inference
          Hybrid"]
  G --> H["Post-process
           Policy/PII"]

Key choices

Chunking: semantic + overlap; store offsets for citations.
Embeddings: GTE Large EN v1.5; evaluate coverage vs latency.
Index: Delta-backed vector search; freshness vs cost trade-offs.
Inference: hybrid (open + hosted) to balance latency and accuracy.

Example: embed and upsert

from databricks.vector_search.client import VectorSearchClient

vsc = VectorSearchClient()
index = vsc.get_index("main", "transcripts_idx")
index.upsert([
  {"id": "doc:123#p5", "text": "...", "metadata": {"source": "call"}}
])

Evaluation & guardrails

Offline: Recall@k, response faithfulness, toxicity/policy checks.
Online: user feedback, fallback/abstain behavior.

Cost/latency tips

Batch embeddings; cache frequent queries; keep vector dim reasonable.
Monitor token usage; pre-validate prompts; route by difficulty.

Pipeline overview#

Key choices#

Example: embed and upsert#

Evaluation & guardrails#

Cost/latency tips#

Pipeline overview

Key choices

Example: embed and upsert

Evaluation & guardrails

Cost/latency tips