Engineering

Databricks MLOps Playbook: From MLflow to Production

This playbook distills a pragmatic MLOps path on Databricks: from data prep to robust deployment with guardrails. Why another MLOps guide? Focus on operational reality: lineage, reproducibility, cost/latency, and KPI reliability. Re-usable patterns you can drop into teams without heavy ceremony. Reference architecture flowchart TD A["Ingest: Batch/Streaming"] --> B["Bronze Delta"] B --> C["Curate: Features"] C --> D["`ML Training MLflow tracking`"] D --> E["`Registry Stages: Staging/Prod`"] E --> F["Serving/Batch Scoring"] F --> G["`Monitoring Drift, KPI, Cost`"] Building blocks Delta Lake: schema evolution, Z-order, OPTIMIZE + VACUUM policies. MLflow: experiment tracking, model registry, stage transitions with approvals. CI/CD: notebooks/jobs packaged via repo; tests for data contracts and model code. Observability: input DQ, feature coverage, drift monitors, KPI windows, cost budgets. Sample: register and deploy import mlflow from mlflow.tracking import MlflowClient run_id = mlflow.active_run().info.run_id mlflow.sklearn.log_model(model, "model") client = MlflowClient() model_uri = f"runs:/{run_id}/model" client.create_registered_model("churn_model") client.create_model_version("churn_model", model_uri, run_id) client.transition_model_version_stage("churn_model", 1, stage="Staging") Guardrails Promotion requires DQ + performance gates; auto-revert on KPI regression. Cost envelopes by job cluster policy; latency SLOs per endpoint. Takeaways Ship small, measurable increments; automate checks; keep lineage and docs close to the code.

Pre‑sales Patterns for Databricks Solutions Engineers

A compact playbook of motions and artifacts that convert technical credibility into business outcomes. Core motions Discovery → value mapping → success criteria → lightweight PoV plan. Hands-on demo/POC: data onboarding, Delta reliability, MLflow, governance. Objection handling: performance, cost, migration, security/governance. Workshop kit (reuse) Reference architectures (ingest, Lakehouse, MLOps, RAG) and checklists. Demo datasets and notebooks; pre-wired CI/CD and QA gates. Post‑POC artifacts: runbooks, sizing, adoption roadmap, risk register. Example PoV success criteria Time‑to‑first Delta ingestion and DQ %. MLflow registered model with promotion gates. RAG: Recall@k and latency thresholds; data governance and PII controls. SE hygiene Record assumptions, map stakeholders, and publish decisions. Measure outcomes; close gaps with targeted enablement.

RAG on Databricks: Embeddings, Vector Search, and Cost/Latency Tuning

A practical guide to building Retrieval-Augmented Generation on Databricks. Pipeline overview flowchart LR A[Docs/Transcripts] --> B[Chunk & Clean] B --> C["Embed GTE Large EN v1.5"] C --> D["Vector Index Vector Search"] D --> E["Retrieve"] E --> F["Compose Prompt"] F --> G["LLM Inference Hybrid"] G --> H["Post-process Policy/PII"] Key choices Chunking: semantic + overlap; store offsets for citations. Embeddings: GTE Large EN v1.5; evaluate coverage vs latency. Index: Delta-backed vector search; freshness vs cost trade-offs. Inference: hybrid (open + hosted) to balance latency and accuracy. Example: embed and upsert from databricks.vector_search.client import VectorSearchClient vsc = VectorSearchClient() index = vsc.get_index("main", "transcripts_idx") index.upsert([ {"id": "doc:123#p5", "text": "...", "metadata": {"source": "call"}} ]) Evaluation & guardrails Offline: Recall@k, response faithfulness, toxicity/policy checks. Online: user feedback, fallback/abstain behavior. Cost/latency tips Batch embeddings; cache frequent queries; keep vector dim reasonable. Monitor token usage; pre-validate prompts; route by difficulty.

Petroleum Analytics Platform Architecture (2015–2018)

This post captures the practical architecture (2015–2018) that supported the upstream evaluation & stochastic modeling framework outlined in the related post: Upstream Asset Evaluation Framework. It maps legacy design choices to modern terminology and highlights constraints that shaped modeling workflows. 1. Core Principles On‑prem / hybrid HPC + Hadoop (YARN) cluster for heavy simulation; limited early cloud (select AWS EC2/EMR/S3; occasional Azure VM/Blob). No unified “lakehouse” yet: layered zones → Raw (HDFS/S3) → Curated (Hive/Parquet/ORC) → Marts (Hive/Impala/Presto). Limited containers/Kubernetes; batch schedulers dominated (Oozie, early Airflow pilot, Control‑M, Cron). Governance largely manual: Hive Metastore + ad hoc catalog (Excel / SharePoint / SQL). 2. Data Ingestion Source Type Examples Mechanism Notes Geoscience LAS, SEG-Y Batch file drop + ETL parse Large binary + metadata extraction Well / Ops WITSML feeds Batch pull / scheduled parse Standardization step into Hive ERP / Finance CSV / RDBMS exports Sqoop (RDBMS→HDFS), SSIS, Python/.NET ETL Controlled nightly cadence SCADA / Events Downtime logs Kafka 0.8/0.9 (where deployed) or Flume/Logstash Early streaming footprint Market / Pricing Excel price decks Staged in SQL then approved to config tables Manual approval workflow Workflow orchestration: Oozie XML workflows early; selective Airflow DAGs (late 2017–2018) for transparency and dependency visualization. ...

Upstream Asset Evaluation & Stochastic Economic Modeling Framework

This post reconstructs the evaluation, financial modeling, and decision analytics framework used when leading an upstream (oil & gas) analytics team (circa 2015–2018). It blends technical reservoir & production modeling with fiscal, stochastic, real‑options, and portfolio layers plus emerging carbon governance. 1. Checklist (Top-Level Components) Scope definition Technical (subsurface & production) models Commercial & fiscal models Market & price modeling Cost & economic models Real options layer Stochastic engine & correlations Portfolio aggregation Risk & sensitivity Carbon / ESG integration Data architecture & governance Validation & model risk management Implementation blueprint 2. Scope & Objectives Asset lifecycle: exploration → appraisal → development planning → execution → ramp-up → plateau → decline → abandonment. Decisions supported: license bidding, sanction (FID), phasing, drilling sequence, facility sizing, hedging, M&A, divestment, suspension, expansion, abandonment timing. Outputs: NPV (pre/post tax), IRR, payback, PI, EMV / ENPV, free cash flow profiles, value at risk (P10/P50/P90), option-adjusted value, carbon-adjusted value, capital efficiency, portfolio efficient frontier. ...