Architecture

Data Contract Evolution: From Ad Hoc Schemas to Governed Interfaces

This post outlines a pragmatic roadmap to evolve data contracts in an analytics platform. It favors incremental adoption over big‑bang rewrites and ties directly to operational needs: stability, speed, and safety. 1) Why contracts (and why now) Control schema drift and reduce breakage. Enable fast feedback (pre‑prod contract tests) and safer evolution. Improve ownership clarity and auditability. 2) Contract primitives Interface = table/view/topic + schema + semantics + SLA. Versioned schema in source control (YAML/JSON) with description and constraints. Example: minimal YAML for a dataset contract. name: carrier_performance version: 1.2.0 owner: analytics.eng@acme.example schema: - name: carrier_id type: string constraints: [not_null] - name: ds type: date constraints: [not_null] - name: on_time_pct type: decimal(5,2) constraints: [range: {min: 0, max: 100}] - name: loads type: int constraints: [not_null, gte: 0] slas: freshness: {max_lag_hours: 6} completeness: {min_rows: 1000} 3) Evolution rules SemVer: MAJOR for breaking, MINOR for additive, PATCH for docs/constraints. Deprecation windows and dual‑publishing for breaking changes. Compatibility gate in CI (producer and consumer tests). # Pseudo‑CI checks contract check producer --path contracts/carrier_performance.yaml --against current-schema.json contract check consumers --dataset carrier_performance --min-version 1.1.0 4) Validation patterns (Spark/Delta) Type/shape validation in ETL; catalog registration only on pass. Constraint checks materialized to a dq_issue table. # PySpark example: enforce contract types and simple constraints from pyspark.sql import functions as F, types as T def validate(df): expected = T.StructType([ T.StructField("carrier_id", T.StringType(), False), T.StructField("ds", T.DateType(), False), T.StructField("on_time_pct", T.DecimalType(5,2), True), T.StructField("loads", T.IntegerType(), False), ]) casted = df.select( F.col("carrier_id").cast("string").alias("carrier_id"), F.to_date("ds").alias("ds"), F.col("on_time_pct").cast("decimal(5,2)").alias("on_time_pct"), F.col("loads").cast("int").alias("loads"), ) dq = [] dq.append(F.count(F.when(F.col("carrier_id").isNull(), 1)).alias("carrier_id_null")) dq.append(F.count(F.when(F.col("ds").isNull(), 1)).alias("ds_null")) dq_df = casted.select(*dq) return casted, dq_df 5) Contract tests for consumers Consumers pin to a minimum compatible version and verify invariants. E.g., downstream view expects on_time_pct present and within 0–100. -- dbt / SQL test snippet select count(*) as violations from {{ ref('carrier_performance') }} where on_time_pct is null or on_time_pct < 0 or on_time_pct > 100 6) Governance & catalog Register contracts in a catalog (DataHub/OpenMetadata) with lineage and ownership. Emit events on contract version changes and SLA breaches. 7) Rollout plan Phase 1: Document schemas for 3 high‑impact datasets; add read‑only validation. Phase 2: Enforce contract in CI for producers; add consumer tests. Phase 3: Introduce SemVer and deprecation workflows; visibility via catalog. Phase 4: Automate change review and impact analysis with lineage. 8) Reference diagram flowchart TB A[Producer ETL] --> B[(Contract YAML)] B --> C{CI checks} C -->|pass| D[Publish to Curated] C -->|fail| E[Block & Report] D --> F[Consumers] F --> G[Contract Tests] D --> H[Catalog / Lineage] Pragmatic takeaway: contracts lower entropy and raise change velocity. Start small, enforce where it matters, and expand coverage with clear ownership.

Privacy-Aware Pipelines: PII Detection, Redaction, and Governance at Scale

This post shares a practical architecture for privacy-aware data processing on Spark/Delta with PII discovery, redaction, and auditability. It reflects patterns used to replace external DLP APIs with in‑house pipelines. 1) Objectives Discover PII across semi/unstructured data (text/JSON). Redact or tokenize with policy-driven transformations. Preserve utility for analytics and ensure auditability. 2) Architecture Overview flowchart TB A[Landing / Bronze] --> B{PII Scanner} B -->|classified| C[Redaction / Tokenization] C --> D[Delta Curated] B -->|no PII| D D --> E[Access Zones / Views] E --> F[Analytics / BI] D --> G[Audit Tables] 3) PII Detection (Presidio + Spark NLP) Use Presidio analyzers for entities (EMAIL, PHONE, CREDIT_CARD, PERSON, etc.). Complement with domain regex and Spark NLP for names/locations if needed. Confidence thresholds and context words to reduce false positives. from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine analyzer = AnalyzerEngine() anonymizer = AnonymizerEngine() text = "Contact John at john.doe@example.com or +1-555-123-4567" results = analyzer.analyze(text=text, language='en') redacted = anonymizer.anonymize(text=text, anonymize_entities=[{"type": r.entity_type, "new_value": "<REDACTED>"} for r in results]).text print(redacted) 4) Spark Integration Pattern UDF wrapper calling Presidio per row (for small/medium texts); for large docs, batch per partition. Structured outputs: keep original column, plus redacted_text and pii_entities (array of structs). Deterministic tokenization for referential integrity: hash with salt for IDs. from pyspark.sql import functions as F, types as T def redact_text(text: str) -> str: # call analyzer/anonymizer; return redacted text return text # placeholder redact_udf = F.udf(redact_text, T.StringType()) df_out = ( df_in .withColumn("redacted_text", redact_udf(F.col("text"))) ) 5) Governance & Audit Write detections to pii_detection_log with columns: id, entity_type, start, end, confidence, doc_id, ds. Track policy decisions (mask/tokenize/pass) and version the policy set. Access via views: analysts see redacted columns by default; elevated roles can request reveal with approvals. 6) Quality & Monitoring Metrics: percent docs with PII, entity counts by type, redaction coverage. Drift detection on entity distributions to catch model/pattern degradation. Sampling UI for manual review (Streamlit or simple web). 7) Operational Considerations Throughput: consider broadcasted dictionaries/context words; avoid heavy Python UDFs where possible (Scala UDFs or native libs faster). Cost: cache pre-processing, process incrementally, skip re-redaction on unchanged docs (idempotence). Security: store salts/keys in secret scopes; lock down raw zones. 8) Minimal Policy Example version: 0.3.0 entities: - type: EMAIL action: redact - type: PHONE_NUMBER action: redact - type: PERSON action: tokenize token_salt_secret: keyvault://pii-tokens/person-salt 9) Rollout Plan Pilot on one high‑value dataset (support tickets or transcripts). Add governance hooks (policy version table, audit writes, views). Expand coverage by domain; tune thresholds; add sampling UI. Pragmatic takeaway: privacy-aware pipelines protect users and your org while keeping data useful. Bake in policy, audit, and performance from day one.

Architecture for a Probabilistic Risk Modeling Platform

This post outlines a platform architecture designed to model the impact of a hybrid risk registry (qualitative and quantitative risks) on an oil company’s key financial KPIs like EBITDA and Cash Flow on a monthly basis. The design emphasizes modularity, auditability, and the integration of expert judgment with stochastic simulation. 1. Core Principles & Objectives Single Source of Truth: Establish a centralized, versioned Risk Registry for all identified risks. Hybrid Modeling: Natively support both quantitative risks (modeled with probability distributions) and qualitative risks (modeled with structured expert judgment). Financial Integration: Directly link risk events to a baseline financial plan (P&L, Cash Flow statement) to quantify impact. Probabilistic Output: Move beyond single-point estimates to deliver a distribution of potential outcomes (e.g., P10/P50/P90 EBITDA). Auditability & Reproducibility: Ensure every simulation run is traceable to a specific version of the risk registry, assumptions, and financial baseline. User-Centric Workflow: Provide intuitive interfaces for risk owners to provide input without needing to be simulation experts. 2. High-Level Architecture The platform is designed as a set of modular services that interact through well-defined APIs and a shared data layer. ...

Petroleum Analytics Platform Architecture (2015–2018)

This post captures the practical architecture (2015–2018) that supported the upstream evaluation & stochastic modeling framework outlined in the related post: Upstream Asset Evaluation Framework. It maps legacy design choices to modern terminology and highlights constraints that shaped modeling workflows. 1. Core Principles On‑prem / hybrid HPC + Hadoop (YARN) cluster for heavy simulation; limited early cloud (select AWS EC2/EMR/S3; occasional Azure VM/Blob). No unified “lakehouse” yet: layered zones → Raw (HDFS/S3) → Curated (Hive/Parquet/ORC) → Marts (Hive/Impala/Presto). Limited containers/Kubernetes; batch schedulers dominated (Oozie, early Airflow pilot, Control‑M, Cron). Governance largely manual: Hive Metastore + ad hoc catalog (Excel / SharePoint / SQL). 2. Data Ingestion Source Type Examples Mechanism Notes Geoscience LAS, SEG-Y Batch file drop + ETL parse Large binary + metadata extraction Well / Ops WITSML feeds Batch pull / scheduled parse Standardization step into Hive ERP / Finance CSV / RDBMS exports Sqoop (RDBMS→HDFS), SSIS, Python/.NET ETL Controlled nightly cadence SCADA / Events Downtime logs Kafka 0.8/0.9 (where deployed) or Flume/Logstash Early streaming footprint Market / Pricing Excel price decks Staged in SQL then approved to config tables Manual approval workflow Workflow orchestration: Oozie XML workflows early; selective Airflow DAGs (late 2017–2018) for transparency and dependency visualization. ...