RAG on Databricks: Embeddings, Vector Search, and Cost/Latency Tuning
A practical guide to building Retrieval-Augmented Generation on Databricks. Pipeline overview flowchart LR A[Docs/Transcripts] --> B[Chunk & Clean] B --> C["Embed GTE Large EN v1.5"] C --> D["Vector Index Vector Search"] D --> E["Retrieve"] E --> F["Compose Prompt"] F --> G["LLM Inference Hybrid"] G --> H["Post-process Policy/PII"] Key choices Chunking: semantic + overlap; store offsets for citations. Embeddings: GTE Large EN v1.5; evaluate coverage vs latency. Index: Delta-backed vector search; freshness vs cost trade-offs. Inference: hybrid (open + hosted) to balance latency and accuracy. Example: embed and upsert from databricks.vector_search.client import VectorSearchClient vsc = VectorSearchClient() index = vsc.get_index("main", "transcripts_idx") index.upsert([ {"id": "doc:123#p5", "text": "...", "metadata": {"source": "call"}} ]) Evaluation & guardrails Offline: Recall@k, response faithfulness, toxicity/policy checks. Online: user feedback, fallback/abstain behavior. Cost/latency tips Batch embeddings; cache frequent queries; keep vector dim reasonable. Monitor token usage; pre-validate prompts; route by difficulty.