This post shares a practical architecture for privacy-aware data processing on Spark/Delta with PII discovery, redaction, and auditability. It reflects patterns used to replace external DLP APIs with in‑house pipelines.


1) Objectives

  • Discover PII across semi/unstructured data (text/JSON).
  • Redact or tokenize with policy-driven transformations.
  • Preserve utility for analytics and ensure auditability.

2) Architecture Overview

flowchart TB
  A[Landing / Bronze] --> B{PII Scanner}
  B -->|classified| C[Redaction / Tokenization]
  C --> D[Delta Curated]
  B -->|no PII| D
  D --> E[Access Zones / Views]
  E --> F[Analytics / BI]
  D --> G[Audit Tables]

3) PII Detection (Presidio + Spark NLP)

  • Use Presidio analyzers for entities (EMAIL, PHONE, CREDIT_CARD, PERSON, etc.).
  • Complement with domain regex and Spark NLP for names/locations if needed.
  • Confidence thresholds and context words to reduce false positives.
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = "Contact John at john.doe@example.com or +1-555-123-4567"
results = analyzer.analyze(text=text, language='en')
redacted = anonymizer.anonymize(text=text, anonymize_entities=[{"type": r.entity_type, "new_value": "<REDACTED>"} for r in results]).text
print(redacted)

4) Spark Integration Pattern

  • UDF wrapper calling Presidio per row (for small/medium texts); for large docs, batch per partition.
  • Structured outputs: keep original column, plus redacted_text and pii_entities (array of structs).
  • Deterministic tokenization for referential integrity: hash with salt for IDs.
from pyspark.sql import functions as F, types as T

def redact_text(text: str) -> str:
    # call analyzer/anonymizer; return redacted text
    return text  # placeholder

redact_udf = F.udf(redact_text, T.StringType())

df_out = (
    df_in
    .withColumn("redacted_text", redact_udf(F.col("text")))
)

5) Governance & Audit

  • Write detections to pii_detection_log with columns: id, entity_type, start, end, confidence, doc_id, ds.
  • Track policy decisions (mask/tokenize/pass) and version the policy set.
  • Access via views: analysts see redacted columns by default; elevated roles can request reveal with approvals.

6) Quality & Monitoring

  • Metrics: percent docs with PII, entity counts by type, redaction coverage.
  • Drift detection on entity distributions to catch model/pattern degradation.
  • Sampling UI for manual review (Streamlit or simple web).

7) Operational Considerations

  • Throughput: consider broadcasted dictionaries/context words; avoid heavy Python UDFs where possible (Scala UDFs or native libs faster).
  • Cost: cache pre-processing, process incrementally, skip re-redaction on unchanged docs (idempotence).
  • Security: store salts/keys in secret scopes; lock down raw zones.

8) Minimal Policy Example

version: 0.3.0
entities:
  - type: EMAIL
    action: redact
  - type: PHONE_NUMBER
    action: redact
  - type: PERSON
    action: tokenize
    token_salt_secret: keyvault://pii-tokens/person-salt

9) Rollout Plan

  • Pilot on one high‑value dataset (support tickets or transcripts).
  • Add governance hooks (policy version table, audit writes, views).
  • Expand coverage by domain; tune thresholds; add sampling UI.

Pragmatic takeaway: privacy-aware pipelines protect users and your org while keeping data useful. Bake in policy, audit, and performance from day one.