Data-Quality

Data Contract Evolution: From Ad Hoc Schemas to Governed Interfaces

This post outlines a pragmatic roadmap to evolve data contracts in an analytics platform. It favors incremental adoption over big‑bang rewrites and ties directly to operational needs: stability, speed, and safety. 1) Why contracts (and why now) Control schema drift and reduce breakage. Enable fast feedback (pre‑prod contract tests) and safer evolution. Improve ownership clarity and auditability. 2) Contract primitives Interface = table/view/topic + schema + semantics + SLA. Versioned schema in source control (YAML/JSON) with description and constraints. Example: minimal YAML for a dataset contract. name: carrier_performance version: 1.2.0 owner: analytics.eng@acme.example schema: - name: carrier_id type: string constraints: [not_null] - name: ds type: date constraints: [not_null] - name: on_time_pct type: decimal(5,2) constraints: [range: {min: 0, max: 100}] - name: loads type: int constraints: [not_null, gte: 0] slas: freshness: {max_lag_hours: 6} completeness: {min_rows: 1000} 3) Evolution rules SemVer: MAJOR for breaking, MINOR for additive, PATCH for docs/constraints. Deprecation windows and dual‑publishing for breaking changes. Compatibility gate in CI (producer and consumer tests). # Pseudo‑CI checks contract check producer --path contracts/carrier_performance.yaml --against current-schema.json contract check consumers --dataset carrier_performance --min-version 1.1.0 4) Validation patterns (Spark/Delta) Type/shape validation in ETL; catalog registration only on pass. Constraint checks materialized to a dq_issue table. # PySpark example: enforce contract types and simple constraints from pyspark.sql import functions as F, types as T def validate(df): expected = T.StructType([ T.StructField("carrier_id", T.StringType(), False), T.StructField("ds", T.DateType(), False), T.StructField("on_time_pct", T.DecimalType(5,2), True), T.StructField("loads", T.IntegerType(), False), ]) casted = df.select( F.col("carrier_id").cast("string").alias("carrier_id"), F.to_date("ds").alias("ds"), F.col("on_time_pct").cast("decimal(5,2)").alias("on_time_pct"), F.col("loads").cast("int").alias("loads"), ) dq = [] dq.append(F.count(F.when(F.col("carrier_id").isNull(), 1)).alias("carrier_id_null")) dq.append(F.count(F.when(F.col("ds").isNull(), 1)).alias("ds_null")) dq_df = casted.select(*dq) return casted, dq_df 5) Contract tests for consumers Consumers pin to a minimum compatible version and verify invariants. E.g., downstream view expects on_time_pct present and within 0–100. -- dbt / SQL test snippet select count(*) as violations from {{ ref('carrier_performance') }} where on_time_pct is null or on_time_pct < 0 or on_time_pct > 100 6) Governance & catalog Register contracts in a catalog (DataHub/OpenMetadata) with lineage and ownership. Emit events on contract version changes and SLA breaches. 7) Rollout plan Phase 1: Document schemas for 3 high‑impact datasets; add read‑only validation. Phase 2: Enforce contract in CI for producers; add consumer tests. Phase 3: Introduce SemVer and deprecation workflows; visibility via catalog. Phase 4: Automate change review and impact analysis with lineage. 8) Reference diagram flowchart TB A[Producer ETL] --> B[(Contract YAML)] B --> C{CI checks} C -->|pass| D[Publish to Curated] C -->|fail| E[Block & Report] D --> F[Consumers] F --> G[Contract Tests] D --> H[Catalog / Lineage] Pragmatic takeaway: contracts lower entropy and raise change velocity. Start small, enforce where it matters, and expand coverage with clear ownership.

Privacy-Aware Pipelines: PII Detection, Redaction, and Governance at Scale

This post shares a practical architecture for privacy-aware data processing on Spark/Delta with PII discovery, redaction, and auditability. It reflects patterns used to replace external DLP APIs with in‑house pipelines. 1) Objectives Discover PII across semi/unstructured data (text/JSON). Redact or tokenize with policy-driven transformations. Preserve utility for analytics and ensure auditability. 2) Architecture Overview flowchart TB A[Landing / Bronze] --> B{PII Scanner} B -->|classified| C[Redaction / Tokenization] C --> D[Delta Curated] B -->|no PII| D D --> E[Access Zones / Views] E --> F[Analytics / BI] D --> G[Audit Tables] 3) PII Detection (Presidio + Spark NLP) Use Presidio analyzers for entities (EMAIL, PHONE, CREDIT_CARD, PERSON, etc.). Complement with domain regex and Spark NLP for names/locations if needed. Confidence thresholds and context words to reduce false positives. from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine analyzer = AnalyzerEngine() anonymizer = AnonymizerEngine() text = "Contact John at john.doe@example.com or +1-555-123-4567" results = analyzer.analyze(text=text, language='en') redacted = anonymizer.anonymize(text=text, anonymize_entities=[{"type": r.entity_type, "new_value": "<REDACTED>"} for r in results]).text print(redacted) 4) Spark Integration Pattern UDF wrapper calling Presidio per row (for small/medium texts); for large docs, batch per partition. Structured outputs: keep original column, plus redacted_text and pii_entities (array of structs). Deterministic tokenization for referential integrity: hash with salt for IDs. from pyspark.sql import functions as F, types as T def redact_text(text: str) -> str: # call analyzer/anonymizer; return redacted text return text # placeholder redact_udf = F.udf(redact_text, T.StringType()) df_out = ( df_in .withColumn("redacted_text", redact_udf(F.col("text"))) ) 5) Governance & Audit Write detections to pii_detection_log with columns: id, entity_type, start, end, confidence, doc_id, ds. Track policy decisions (mask/tokenize/pass) and version the policy set. Access via views: analysts see redacted columns by default; elevated roles can request reveal with approvals. 6) Quality & Monitoring Metrics: percent docs with PII, entity counts by type, redaction coverage. Drift detection on entity distributions to catch model/pattern degradation. Sampling UI for manual review (Streamlit or simple web). 7) Operational Considerations Throughput: consider broadcasted dictionaries/context words; avoid heavy Python UDFs where possible (Scala UDFs or native libs faster). Cost: cache pre-processing, process incrementally, skip re-redaction on unchanged docs (idempotence). Security: store salts/keys in secret scopes; lock down raw zones. 8) Minimal Policy Example version: 0.3.0 entities: - type: EMAIL action: redact - type: PHONE_NUMBER action: redact - type: PERSON action: tokenize token_salt_secret: keyvault://pii-tokens/person-salt 9) Rollout Plan Pilot on one high‑value dataset (support tickets or transcripts). Add governance hooks (policy version table, audit writes, views). Expand coverage by domain; tune thresholds; add sampling UI. Pragmatic takeaway: privacy-aware pipelines protect users and your org while keeping data useful. Bake in policy, audit, and performance from day one.

CarrierPerformanceReportsEtl

CarrierPerformanceReportsEtl CarrierPerformanceReportsEtl is a production Spark/Scala data platform I architected and grew over ~4 years at WTG to ingest, evolve, and serve carrier & logistics performance analytics. I founded it as a solo engineer, then mentored rotating contributors while owning roadmap, standards, and release quality (acting de‑facto team & tech lead while titled Data Scientist / Senior Data Scientist). 1. Problem & Context Logistics operations required timely, reliable KPIs (turnaround, message latency, carrier performance) sourced from heterogeneous semi‑structured message streams and relational systems. Challenges: ...