Privacy-Aware Pipelines: PII Detection, Redaction, and Governance at Scale
This post shares a practical architecture for privacy-aware data processing on Spark/Delta with PII discovery, redaction, and auditability. It reflects patterns used to replace external DLP APIs with in‑house pipelines. 1) Objectives Discover PII across semi/unstructured data (text/JSON). Redact or tokenize with policy-driven transformations. Preserve utility for analytics and ensure auditability. 2) Architecture Overview flowchart TB A[Landing / Bronze] --> B{PII Scanner} B -->|classified| C[Redaction / Tokenization] C --> D[Delta Curated] B -->|no PII| D D --> E[Access Zones / Views] E --> F[Analytics / BI] D --> G[Audit Tables] 3) PII Detection (Presidio + Spark NLP) Use Presidio analyzers for entities (EMAIL, PHONE, CREDIT_CARD, PERSON, etc.). Complement with domain regex and Spark NLP for names/locations if needed. Confidence thresholds and context words to reduce false positives. from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine analyzer = AnalyzerEngine() anonymizer = AnonymizerEngine() text = "Contact John at john.doe@example.com or +1-555-123-4567" results = analyzer.analyze(text=text, language='en') redacted = anonymizer.anonymize(text=text, anonymize_entities=[{"type": r.entity_type, "new_value": "<REDACTED>"} for r in results]).text print(redacted) 4) Spark Integration Pattern UDF wrapper calling Presidio per row (for small/medium texts); for large docs, batch per partition. Structured outputs: keep original column, plus redacted_text and pii_entities (array of structs). Deterministic tokenization for referential integrity: hash with salt for IDs. from pyspark.sql import functions as F, types as T def redact_text(text: str) -> str: # call analyzer/anonymizer; return redacted text return text # placeholder redact_udf = F.udf(redact_text, T.StringType()) df_out = ( df_in .withColumn("redacted_text", redact_udf(F.col("text"))) ) 5) Governance & Audit Write detections to pii_detection_log with columns: id, entity_type, start, end, confidence, doc_id, ds. Track policy decisions (mask/tokenize/pass) and version the policy set. Access via views: analysts see redacted columns by default; elevated roles can request reveal with approvals. 6) Quality & Monitoring Metrics: percent docs with PII, entity counts by type, redaction coverage. Drift detection on entity distributions to catch model/pattern degradation. Sampling UI for manual review (Streamlit or simple web). 7) Operational Considerations Throughput: consider broadcasted dictionaries/context words; avoid heavy Python UDFs where possible (Scala UDFs or native libs faster). Cost: cache pre-processing, process incrementally, skip re-redaction on unchanged docs (idempotence). Security: store salts/keys in secret scopes; lock down raw zones. 8) Minimal Policy Example version: 0.3.0 entities: - type: EMAIL action: redact - type: PHONE_NUMBER action: redact - type: PERSON action: tokenize token_salt_secret: keyvault://pii-tokens/person-salt 9) Rollout Plan Pilot on one high‑value dataset (support tickets or transcripts). Add governance hooks (policy version table, audit writes, views). Expand coverage by domain; tune thresholds; add sampling UI. Pragmatic takeaway: privacy-aware pipelines protect users and your org while keeping data useful. Bake in policy, audit, and performance from day one.