Spark

Databricks MLOps Playbook: From MLflow to Production

This playbook distills a pragmatic MLOps path on Databricks: from data prep to robust deployment with guardrails. Why another MLOps guide? Focus on operational reality: lineage, reproducibility, cost/latency, and KPI reliability. Re-usable patterns you can drop into teams without heavy ceremony. Reference architecture flowchart TD A["Ingest: Batch/Streaming"] --> B["Bronze Delta"] B --> C["Curate: Features"] C --> D["`ML Training MLflow tracking`"] D --> E["`Registry Stages: Staging/Prod`"] E --> F["Serving/Batch Scoring"] F --> G["`Monitoring Drift, KPI, Cost`"] Building blocks Delta Lake: schema evolution, Z-order, OPTIMIZE + VACUUM policies. MLflow: experiment tracking, model registry, stage transitions with approvals. CI/CD: notebooks/jobs packaged via repo; tests for data contracts and model code. Observability: input DQ, feature coverage, drift monitors, KPI windows, cost budgets. Sample: register and deploy import mlflow from mlflow.tracking import MlflowClient run_id = mlflow.active_run().info.run_id mlflow.sklearn.log_model(model, "model") client = MlflowClient() model_uri = f"runs:/{run_id}/model" client.create_registered_model("churn_model") client.create_model_version("churn_model", model_uri, run_id) client.transition_model_version_stage("churn_model", 1, stage="Staging") Guardrails Promotion requires DQ + performance gates; auto-revert on KPI regression. Cost envelopes by job cluster policy; latency SLOs per endpoint. Takeaways Ship small, measurable increments; automate checks; keep lineage and docs close to the code.

RAG on Databricks: Embeddings, Vector Search, and Cost/Latency Tuning

A practical guide to building Retrieval-Augmented Generation on Databricks. Pipeline overview flowchart LR A[Docs/Transcripts] --> B[Chunk & Clean] B --> C["Embed GTE Large EN v1.5"] C --> D["Vector Index Vector Search"] D --> E["Retrieve"] E --> F["Compose Prompt"] F --> G["LLM Inference Hybrid"] G --> H["Post-process Policy/PII"] Key choices Chunking: semantic + overlap; store offsets for citations. Embeddings: GTE Large EN v1.5; evaluate coverage vs latency. Index: Delta-backed vector search; freshness vs cost trade-offs. Inference: hybrid (open + hosted) to balance latency and accuracy. Example: embed and upsert from databricks.vector_search.client import VectorSearchClient vsc = VectorSearchClient() index = vsc.get_index("main", "transcripts_idx") index.upsert([ {"id": "doc:123#p5", "text": "...", "metadata": {"source": "call"}} ]) Evaluation & guardrails Offline: Recall@k, response faithfulness, toxicity/policy checks. Online: user feedback, fallback/abstain behavior. Cost/latency tips Batch embeddings; cache frequent queries; keep vector dim reasonable. Monitor token usage; pre-validate prompts; route by difficulty.

Privacy-Aware Pipelines: PII Detection, Redaction, and Governance at Scale

This post shares a practical architecture for privacy-aware data processing on Spark/Delta with PII discovery, redaction, and auditability. It reflects patterns used to replace external DLP APIs with in‑house pipelines. 1) Objectives Discover PII across semi/unstructured data (text/JSON). Redact or tokenize with policy-driven transformations. Preserve utility for analytics and ensure auditability. 2) Architecture Overview flowchart TB A[Landing / Bronze] --> B{PII Scanner} B -->|classified| C[Redaction / Tokenization] C --> D[Delta Curated] B -->|no PII| D D --> E[Access Zones / Views] E --> F[Analytics / BI] D --> G[Audit Tables] 3) PII Detection (Presidio + Spark NLP) Use Presidio analyzers for entities (EMAIL, PHONE, CREDIT_CARD, PERSON, etc.). Complement with domain regex and Spark NLP for names/locations if needed. Confidence thresholds and context words to reduce false positives. from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine analyzer = AnalyzerEngine() anonymizer = AnonymizerEngine() text = "Contact John at john.doe@example.com or +1-555-123-4567" results = analyzer.analyze(text=text, language='en') redacted = anonymizer.anonymize(text=text, anonymize_entities=[{"type": r.entity_type, "new_value": "<REDACTED>"} for r in results]).text print(redacted) 4) Spark Integration Pattern UDF wrapper calling Presidio per row (for small/medium texts); for large docs, batch per partition. Structured outputs: keep original column, plus redacted_text and pii_entities (array of structs). Deterministic tokenization for referential integrity: hash with salt for IDs. from pyspark.sql import functions as F, types as T def redact_text(text: str) -> str: # call analyzer/anonymizer; return redacted text return text # placeholder redact_udf = F.udf(redact_text, T.StringType()) df_out = ( df_in .withColumn("redacted_text", redact_udf(F.col("text"))) ) 5) Governance & Audit Write detections to pii_detection_log with columns: id, entity_type, start, end, confidence, doc_id, ds. Track policy decisions (mask/tokenize/pass) and version the policy set. Access via views: analysts see redacted columns by default; elevated roles can request reveal with approvals. 6) Quality & Monitoring Metrics: percent docs with PII, entity counts by type, redaction coverage. Drift detection on entity distributions to catch model/pattern degradation. Sampling UI for manual review (Streamlit or simple web). 7) Operational Considerations Throughput: consider broadcasted dictionaries/context words; avoid heavy Python UDFs where possible (Scala UDFs or native libs faster). Cost: cache pre-processing, process incrementally, skip re-redaction on unchanged docs (idempotence). Security: store salts/keys in secret scopes; lock down raw zones. 8) Minimal Policy Example version: 0.3.0 entities: - type: EMAIL action: redact - type: PHONE_NUMBER action: redact - type: PERSON action: tokenize token_salt_secret: keyvault://pii-tokens/person-salt 9) Rollout Plan Pilot on one high‑value dataset (support tickets or transcripts). Add governance hooks (policy version table, audit writes, views). Expand coverage by domain; tune thresholds; add sampling UI. Pragmatic takeaway: privacy-aware pipelines protect users and your org while keeping data useful. Bake in policy, audit, and performance from day one.

CargoIMP Spark Parser

cargoimp-spark-parser is a Scala library that extends your Apache Spark jobs with native high-performance parsing for IATA Cargo-IMP (International Air Transport Association Cargo Interchange Message Procedures) messages. It empowers Spark users in air cargo, logistics, and data engineering domains by making Cargo-IMP message types—such as FHL and FWB—directly accessible, explorable, and analyzable within Spark DataFrames and SQL queries. Project Link: https://rokorolev.gitlab.io/cargoimp-spark-parser/

Fantastic Spork

In real-world analytics, Spark users often need to do things like count substrings, tally words in collections, or process text—tasks not always convenient with Spark’s built-in SQL functions. fantastic-spork delivers production-ready, native Catalyst expressions for these cases, ensuring top Spark performance and seamless integration. More efficient than regular Scala UDFs Convenient SQL extensions Composable for DataFrame, Dataset, and SQL APIs Project Link: https://rokorolev.gitlab.io/fantastic-spork/

Petroleum Analytics Platform Architecture (2015–2018)

This post captures the practical architecture (2015–2018) that supported the upstream evaluation & stochastic modeling framework outlined in the related post: Upstream Asset Evaluation Framework. It maps legacy design choices to modern terminology and highlights constraints that shaped modeling workflows. 1. Core Principles On‑prem / hybrid HPC + Hadoop (YARN) cluster for heavy simulation; limited early cloud (select AWS EC2/EMR/S3; occasional Azure VM/Blob). No unified “lakehouse” yet: layered zones → Raw (HDFS/S3) → Curated (Hive/Parquet/ORC) → Marts (Hive/Impala/Presto). Limited containers/Kubernetes; batch schedulers dominated (Oozie, early Airflow pilot, Control‑M, Cron). Governance largely manual: Hive Metastore + ad hoc catalog (Excel / SharePoint / SQL). 2. Data Ingestion Source Type Examples Mechanism Notes Geoscience LAS, SEG-Y Batch file drop + ETL parse Large binary + metadata extraction Well / Ops WITSML feeds Batch pull / scheduled parse Standardization step into Hive ERP / Finance CSV / RDBMS exports Sqoop (RDBMS→HDFS), SSIS, Python/.NET ETL Controlled nightly cadence SCADA / Events Downtime logs Kafka 0.8/0.9 (where deployed) or Flume/Logstash Early streaming footprint Market / Pricing Excel price decks Staged in SQL then approved to config tables Manual approval workflow Workflow orchestration: Oozie XML workflows early; selective Airflow DAGs (late 2017–2018) for transparency and dependency visualization. ...

RedactifyAI

RedactifyAI is a Python package for detecting and anonymizing sensitive Personally Identifiable Information (PII) in textual data using Microsoft’s Presidio and Apache Spark. Key Features Integration with Presidio to detect and anonymize PII such as names, emails, phone numbers, and more. Spark-powered processing for scalable anonymization using PySpark. Custom recognizers to extend PII detection for specific needs. Project Link: https://rokorolev.gitlab.io/redactify-ai/

Sparqlin

sparqlin is a Spark SQL framework designed to simplify job creation and management in Databricks environments. It integrates with Spark SQL and PySpark for a streamlined development experience. The framework was specifically created to empower data analysts who may not have deep development skills. It provides a streamlined approach to adopting standard software development life cycles, enabling analysts to focus on working with data without the need to master complex programming paradigms. By leveraging familiar tools like SQL scripts and YAML files, the framework simplifies tasks such as data configuration, transformation, and testing. ...

CarrierPerformanceReportsEtl

CarrierPerformanceReportsEtl CarrierPerformanceReportsEtl is a production Spark/Scala data platform I architected and grew over ~4 years at WTG to ingest, evolve, and serve carrier & logistics performance analytics. I founded it as a solo engineer, then mentored rotating contributors while owning roadmap, standards, and release quality (acting de‑facto team & tech lead while titled Data Scientist / Senior Data Scientist). 1. Problem & Context Logistics operations required timely, reliable KPIs (turnaround, message latency, carrier performance) sourced from heterogeneous semi‑structured message streams and relational systems. Challenges: ...