A practical guide to building Retrieval-Augmented Generation on Databricks.
Pipeline overview
flowchart LR
A[Docs/Transcripts] --> B[Chunk & Clean]
B --> C["Embed
GTE Large EN v1.5"]
C --> D["Vector Index
Vector Search"]
D --> E["Retrieve"]
E --> F["Compose Prompt"]
F --> G["LLM Inference
Hybrid"]
G --> H["Post-process
Policy/PII"]
Key choices
- Chunking: semantic + overlap; store offsets for citations.
- Embeddings: GTE Large EN v1.5; evaluate coverage vs latency.
- Index: Delta-backed vector search; freshness vs cost trade-offs.
- Inference: hybrid (open + hosted) to balance latency and accuracy.
Example: embed and upsert
from databricks.vector_search.client import VectorSearchClient
vsc = VectorSearchClient()
index = vsc.get_index("main", "transcripts_idx")
index.upsert([
{"id": "doc:123#p5", "text": "...", "metadata": {"source": "call"}}
])
Evaluation & guardrails
- Offline: Recall@k, response faithfulness, toxicity/policy checks.
- Online: user feedback, fallback/abstain behavior.
Cost/latency tips
- Batch embeddings; cache frequent queries; keep vector dim reasonable.
- Monitor token usage; pre-validate prompts; route by difficulty.