Posts

PixieToon: Cartoonize Your Photos with Android & TensorFlow Lite

I’m excited to share PixieToon, a personal project where I built an Android app that cartoonizes your photos using a pre-trained TensorFlow Lite model. The app leverages the White-Box Cartoonization method (CVPR 2020) and demonstrates how to integrate ML models into mobile apps. Users can select a photo, apply the cartoon effect, and save the result—all on-device. Key features: Simple gallery selection and cartoonization workflow Fast, on-device inference with TensorFlow Lite Open source (GitLab repo) Check out the project page or demo for more details and screenshots.

Databricks MLOps Playbook: From MLflow to Production

This playbook distills a pragmatic MLOps path on Databricks: from data prep to robust deployment with guardrails. Why another MLOps guide? Focus on operational reality: lineage, reproducibility, cost/latency, and KPI reliability. Re-usable patterns you can drop into teams without heavy ceremony. Reference architecture flowchart TD A["Ingest: Batch/Streaming"] --> B["Bronze Delta"] B --> C["Curate: Features"] C --> D["`ML Training MLflow tracking`"] D --> E["`Registry Stages: Staging/Prod`"] E --> F["Serving/Batch Scoring"] F --> G["`Monitoring Drift, KPI, Cost`"] Building blocks Delta Lake: schema evolution, Z-order, OPTIMIZE + VACUUM policies. MLflow: experiment tracking, model registry, stage transitions with approvals. CI/CD: notebooks/jobs packaged via repo; tests for data contracts and model code. Observability: input DQ, feature coverage, drift monitors, KPI windows, cost budgets. Sample: register and deploy import mlflow from mlflow.tracking import MlflowClient run_id = mlflow.active_run().info.run_id mlflow.sklearn.log_model(model, "model") client = MlflowClient() model_uri = f"runs:/{run_id}/model" client.create_registered_model("churn_model") client.create_model_version("churn_model", model_uri, run_id) client.transition_model_version_stage("churn_model", 1, stage="Staging") Guardrails Promotion requires DQ + performance gates; auto-revert on KPI regression. Cost envelopes by job cluster policy; latency SLOs per endpoint. Takeaways Ship small, measurable increments; automate checks; keep lineage and docs close to the code.

Pre‑sales Patterns for Databricks Solutions Engineers

A compact playbook of motions and artifacts that convert technical credibility into business outcomes. Core motions Discovery → value mapping → success criteria → lightweight PoV plan. Hands-on demo/POC: data onboarding, Delta reliability, MLflow, governance. Objection handling: performance, cost, migration, security/governance. Workshop kit (reuse) Reference architectures (ingest, Lakehouse, MLOps, RAG) and checklists. Demo datasets and notebooks; pre-wired CI/CD and QA gates. Post‑POC artifacts: runbooks, sizing, adoption roadmap, risk register. Example PoV success criteria Time‑to‑first Delta ingestion and DQ %. MLflow registered model with promotion gates. RAG: Recall@k and latency thresholds; data governance and PII controls. SE hygiene Record assumptions, map stakeholders, and publish decisions. Measure outcomes; close gaps with targeted enablement.

RAG on Databricks: Embeddings, Vector Search, and Cost/Latency Tuning

A practical guide to building Retrieval-Augmented Generation on Databricks. Pipeline overview flowchart LR A[Docs/Transcripts] --> B[Chunk & Clean] B --> C["Embed GTE Large EN v1.5"] C --> D["Vector Index Vector Search"] D --> E["Retrieve"] E --> F["Compose Prompt"] F --> G["LLM Inference Hybrid"] G --> H["Post-process Policy/PII"] Key choices Chunking: semantic + overlap; store offsets for citations. Embeddings: GTE Large EN v1.5; evaluate coverage vs latency. Index: Delta-backed vector search; freshness vs cost trade-offs. Inference: hybrid (open + hosted) to balance latency and accuracy. Example: embed and upsert from databricks.vector_search.client import VectorSearchClient vsc = VectorSearchClient() index = vsc.get_index("main", "transcripts_idx") index.upsert([ {"id": "doc:123#p5", "text": "...", "metadata": {"source": "call"}} ]) Evaluation & guardrails Offline: Recall@k, response faithfulness, toxicity/policy checks. Online: user feedback, fallback/abstain behavior. Cost/latency tips Batch embeddings; cache frequent queries; keep vector dim reasonable. Monitor token usage; pre-validate prompts; route by difficulty.

Data Contract Evolution: From Ad Hoc Schemas to Governed Interfaces

This post outlines a pragmatic roadmap to evolve data contracts in an analytics platform. It favors incremental adoption over big‑bang rewrites and ties directly to operational needs: stability, speed, and safety. 1) Why contracts (and why now) Control schema drift and reduce breakage. Enable fast feedback (pre‑prod contract tests) and safer evolution. Improve ownership clarity and auditability. 2) Contract primitives Interface = table/view/topic + schema + semantics + SLA. Versioned schema in source control (YAML/JSON) with description and constraints. Example: minimal YAML for a dataset contract. name: carrier_performance version: 1.2.0 owner: analytics.eng@acme.example schema: - name: carrier_id type: string constraints: [not_null] - name: ds type: date constraints: [not_null] - name: on_time_pct type: decimal(5,2) constraints: [range: {min: 0, max: 100}] - name: loads type: int constraints: [not_null, gte: 0] slas: freshness: {max_lag_hours: 6} completeness: {min_rows: 1000} 3) Evolution rules SemVer: MAJOR for breaking, MINOR for additive, PATCH for docs/constraints. Deprecation windows and dual‑publishing for breaking changes. Compatibility gate in CI (producer and consumer tests). # Pseudo‑CI checks contract check producer --path contracts/carrier_performance.yaml --against current-schema.json contract check consumers --dataset carrier_performance --min-version 1.1.0 4) Validation patterns (Spark/Delta) Type/shape validation in ETL; catalog registration only on pass. Constraint checks materialized to a dq_issue table. # PySpark example: enforce contract types and simple constraints from pyspark.sql import functions as F, types as T def validate(df): expected = T.StructType([ T.StructField("carrier_id", T.StringType(), False), T.StructField("ds", T.DateType(), False), T.StructField("on_time_pct", T.DecimalType(5,2), True), T.StructField("loads", T.IntegerType(), False), ]) casted = df.select( F.col("carrier_id").cast("string").alias("carrier_id"), F.to_date("ds").alias("ds"), F.col("on_time_pct").cast("decimal(5,2)").alias("on_time_pct"), F.col("loads").cast("int").alias("loads"), ) dq = [] dq.append(F.count(F.when(F.col("carrier_id").isNull(), 1)).alias("carrier_id_null")) dq.append(F.count(F.when(F.col("ds").isNull(), 1)).alias("ds_null")) dq_df = casted.select(*dq) return casted, dq_df 5) Contract tests for consumers Consumers pin to a minimum compatible version and verify invariants. E.g., downstream view expects on_time_pct present and within 0–100. -- dbt / SQL test snippet select count(*) as violations from {{ ref('carrier_performance') }} where on_time_pct is null or on_time_pct < 0 or on_time_pct > 100 6) Governance & catalog Register contracts in a catalog (DataHub/OpenMetadata) with lineage and ownership. Emit events on contract version changes and SLA breaches. 7) Rollout plan Phase 1: Document schemas for 3 high‑impact datasets; add read‑only validation. Phase 2: Enforce contract in CI for producers; add consumer tests. Phase 3: Introduce SemVer and deprecation workflows; visibility via catalog. Phase 4: Automate change review and impact analysis with lineage. 8) Reference diagram flowchart TB A[Producer ETL] --> B[(Contract YAML)] B --> C{CI checks} C -->|pass| D[Publish to Curated] C -->|fail| E[Block & Report] D --> F[Consumers] F --> G[Contract Tests] D --> H[Catalog / Lineage] Pragmatic takeaway: contracts lower entropy and raise change velocity. Start small, enforce where it matters, and expand coverage with clear ownership.

Privacy-Aware Pipelines: PII Detection, Redaction, and Governance at Scale

This post shares a practical architecture for privacy-aware data processing on Spark/Delta with PII discovery, redaction, and auditability. It reflects patterns used to replace external DLP APIs with in‑house pipelines. 1) Objectives Discover PII across semi/unstructured data (text/JSON). Redact or tokenize with policy-driven transformations. Preserve utility for analytics and ensure auditability. 2) Architecture Overview flowchart TB A[Landing / Bronze] --> B{PII Scanner} B -->|classified| C[Redaction / Tokenization] C --> D[Delta Curated] B -->|no PII| D D --> E[Access Zones / Views] E --> F[Analytics / BI] D --> G[Audit Tables] 3) PII Detection (Presidio + Spark NLP) Use Presidio analyzers for entities (EMAIL, PHONE, CREDIT_CARD, PERSON, etc.). Complement with domain regex and Spark NLP for names/locations if needed. Confidence thresholds and context words to reduce false positives. from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine analyzer = AnalyzerEngine() anonymizer = AnonymizerEngine() text = "Contact John at john.doe@example.com or +1-555-123-4567" results = analyzer.analyze(text=text, language='en') redacted = anonymizer.anonymize(text=text, anonymize_entities=[{"type": r.entity_type, "new_value": "<REDACTED>"} for r in results]).text print(redacted) 4) Spark Integration Pattern UDF wrapper calling Presidio per row (for small/medium texts); for large docs, batch per partition. Structured outputs: keep original column, plus redacted_text and pii_entities (array of structs). Deterministic tokenization for referential integrity: hash with salt for IDs. from pyspark.sql import functions as F, types as T def redact_text(text: str) -> str: # call analyzer/anonymizer; return redacted text return text # placeholder redact_udf = F.udf(redact_text, T.StringType()) df_out = ( df_in .withColumn("redacted_text", redact_udf(F.col("text"))) ) 5) Governance & Audit Write detections to pii_detection_log with columns: id, entity_type, start, end, confidence, doc_id, ds. Track policy decisions (mask/tokenize/pass) and version the policy set. Access via views: analysts see redacted columns by default; elevated roles can request reveal with approvals. 6) Quality & Monitoring Metrics: percent docs with PII, entity counts by type, redaction coverage. Drift detection on entity distributions to catch model/pattern degradation. Sampling UI for manual review (Streamlit or simple web). 7) Operational Considerations Throughput: consider broadcasted dictionaries/context words; avoid heavy Python UDFs where possible (Scala UDFs or native libs faster). Cost: cache pre-processing, process incrementally, skip re-redaction on unchanged docs (idempotence). Security: store salts/keys in secret scopes; lock down raw zones. 8) Minimal Policy Example version: 0.3.0 entities: - type: EMAIL action: redact - type: PHONE_NUMBER action: redact - type: PERSON action: tokenize token_salt_secret: keyvault://pii-tokens/person-salt 9) Rollout Plan Pilot on one high‑value dataset (support tickets or transcripts). Add governance hooks (policy version table, audit writes, views). Expand coverage by domain; tune thresholds; add sampling UI. Pragmatic takeaway: privacy-aware pipelines protect users and your org while keeping data useful. Bake in policy, audit, and performance from day one.

Architecture for a Probabilistic Risk Modeling Platform

This post outlines a platform architecture designed to model the impact of a hybrid risk registry (qualitative and quantitative risks) on an oil company’s key financial KPIs like EBITDA and Cash Flow on a monthly basis. The design emphasizes modularity, auditability, and the integration of expert judgment with stochastic simulation. 1. Core Principles & Objectives Single Source of Truth: Establish a centralized, versioned Risk Registry for all identified risks. Hybrid Modeling: Natively support both quantitative risks (modeled with probability distributions) and qualitative risks (modeled with structured expert judgment). Financial Integration: Directly link risk events to a baseline financial plan (P&L, Cash Flow statement) to quantify impact. Probabilistic Output: Move beyond single-point estimates to deliver a distribution of potential outcomes (e.g., P10/P50/P90 EBITDA). Auditability & Reproducibility: Ensure every simulation run is traceable to a specific version of the risk registry, assumptions, and financial baseline. User-Centric Workflow: Provide intuitive interfaces for risk owners to provide input without needing to be simulation experts. 2. High-Level Architecture The platform is designed as a set of modular services that interact through well-defined APIs and a shared data layer. ...

Market Risk Simulation for Multi‑Currency Revenue Linked to Brent (EUR Book)

Market Risk Simulation for Multi‑Currency Revenue Linked to Brent (EUR Book) This guide shows how to simulate monthly revenue risk when commodity prices are linked to Brent (in USD), the book currency is EUR, and you have revenue streams across EUR, USD, INR, and AUD markets. We model Brent with a mean‑reverting process and preserve correlation with FX using a Monte Carlo method. Assumptions: Prices are monthly and linked to Brent in USD. Brent follows a mean‑reverting (Ornstein–Uhlenbeck) process in log‑space. FX pairs are modeled in log‑space (e.g., GBM for simplicity) and jointly simulated with Brent via a correlation matrix. Book currency is EUR; portfolio revenue is aggregated in EUR. Why this setup works: ...

Numerical SDE Methods for Interest Rate Dynamics (Vasicek, CIR, Hull–White)

Numerical SDE Methods for Interest Rate Dynamics This post shows how to simulate short-rate models for interest rates and use them in Monte Carlo pricing/valuation. We cover discretization choices (Euler–Maruyama, Milstein, exact), correlation handling, and practical tips for stability and accuracy. Scenarios: Path‑dependent discounting for cashflows: $DF = \exp\big(-\int r_t dt\big)$. Pricing/valuation under short‑rate models: Vasicek/OU, CIR, and a note on Hull–White. Monthly or finer time steps with correlated factors. Models Vasicek (Ornstein–Uhlenbeck on $r$): $\mathrm{d}r = \kappa(\theta - r)\mathrm{d}t + \sigma \mathrm{d}W_t$. Mean‑reverting, Gaussian; can be negative; has exact discretization in discrete time. CIR (Cox–Ingersoll–Ross): $\mathrm{d}r = \kappa(\theta - r) \mathrm{d}t + \sigma\sqrt{r} \mathrm{d}W_t$. Mean‑reverting, strictly positive if Feller: $2\kappa\theta \ge \sigma^2$; has exact transition via noncentral $\chi^2$. Hull–White (time‑dependent Vasicek): $\mathrm{d}r = a(t)\big(b(t) - r\big) \mathrm{d}t + \sigma(t) \mathrm{d}W_t$. Matches an initial yield curve; can simulate with Euler or semi‑exact integrator. Discretization methods Euler–Maruyama (EM): strong order 0.5; simple and widely used. Milstein: strong order 1.0 for scalar SDEs; adds a diffusion derivative term (useful for √r in CIR). Exact schemes (when available): Vasicek/OU in discrete time has a closed form. CIR has an exact transition: $r_{t+\Delta} = c \cdot \chi^2_{\nu}(\lambda)$ for appropriate $(\nu, \lambda, c)$. Guidance: ...

Data & Analytics Technology Timeline 2025

A compact, opinionated snapshot of the data & advanced analytics ecosystem as of 2025. For each domain: (a) first public emergence, (b) broad adoption / peak phase, (c) current 2025 posture. Legend: Emerging: gaining traction / rapid iteration Active: healthy, broadly adopted, still improving Mature Plateau: stable, incremental change; little green‑field excitement Declining / Legacy: little new adoption; maintenance only Core Data & Big Data Engines Technology First Adoption Peak Window 2025 Status Notes Hadoop (HDFS/MapReduce) 2006 2009–2015 Declining / Legacy Replaced by cloud object storage + engines YARN 2012 2014–2018 Mature Plateau Still underpins legacy Hadoop clusters Hive (SQL on Hadoop) 2008 2012–2017 Mature Plateau Supplanted by Spark SQL / Trino in new builds Pig 2008 2011–2014 Declining Educational / migration only HBase 2008 2013–2017 Niche Legacy Large installs persist Impala 2012 2015–2018 Stable / Niche Limited net-new Presto / Trino 2013 / 2020 rename 2016–2023 Active Federated SQL + lakehouse query Parquet 2013 2015–now Dominant De‑facto columnar format ORC 2013 2015–2018 Mature Plateau Hive-heavy shops Apache Spark 2009 (early), 2014 1.x 2015–now Active Core batch + streaming engine Ray 2019 2022–now Emerging / Active Unified Python distributed workloads Dask 2016 2019–now Active Niche Python analytics / scientific Streaming & Event Processing Technology First Peak Window 2025 Status Notes Kafka 2011 2015–now Active Backbone for streams + CDC Kafka Connect 2015 2018–now Active Connector ecosystem standard Schema Registry 2016 2018–now Active Avro/Protobuf evolution control Flume 2010 2013–2016 Declining Displaced by Kafka connectors Storm 2011 2013–2016 Declining Replaced by Flink / Kafka Streams Flink 2011 2018–now Active Event-time & low-latency streaming Spark Streaming (DStreams) 2012 2014–2018 Legacy Replaced by Structured Streaming Structured Streaming 2016 2018–now Active Unified micro-batch streaming Orchestration & Workflow Technology First Peak Window 2025 Status Notes Oozie 2010 2012–2016 Legacy Legacy Hadoop workflow Airflow 2015 2018–now Active Batch / DAG orchestration leader Control-M 1990s Long-running Mature Plateau Enterprise batch mainstay Lakehouse & Storage Technology / Concept First Peak Window 2025 Status Notes AWS S3 2006 2010–now Active Foundational object store Azure Blob 2008 2013–now Active Core Azure storage GCS 2010 2015–now Active Google cloud object storage Databricks (platform) 2013 2016–now Expanding Consolidated lakehouse workloads Delta Lake 2019 OSS 2020–now High Growth ACID tables + time travel “Lakehouse” Term ~2020 2021–now Mainstream Marketing crystallized architecture Governance / Catalog / Lineage Technology First Peak Window 2025 Status Notes Apache Atlas 2015 2017–2021 Mature / Declining Hadoop-centric Amundsen 2019 2020–2023 Stable Slower vs DataHub DataHub 2020 2022–now Active / Growing Modern metadata / lineage platform ML / MLOps / Modeling Technology / Concept First Peak Window 2025 Status Notes MLflow 2018 2019–now Active Experiment + model registry Feature Store Concept ~2018 2020–now Mature Emerging Feast et al. normalization PyTorch 2016 2018–now Active / Dominant Research → production TensorFlow 2015 2016–2021 peak Mature Plateau Still strong, steadier share PyMC 2003 2015–now Active Bayesian workflows NumPy / SciPy / Pandas 2001 / 2006 / 2008 2012–now Core Foundational ecosystem Ray (ML scaling) 2019 2022–now Growth RL / distributed training LLM / Vector DB Layering 2022 2023–now High Growth Retrieval + augmentation patterns Academic reference: Longstaff–Schwartz (2001) → remains core for American-style options & real asset valuation. ...