Privacy-Aware Pipelines: PII Detection, Redaction, and Governance at Scale

This post shares a practical architecture for privacy-aware data processing on Spark/Delta with PII discovery, redaction, and auditability. It reflects patterns used to replace external DLP APIs with in‑house pipelines.

1) Objectives

Discover PII across semi/unstructured data (text/JSON).
Redact or tokenize with policy-driven transformations.
Preserve utility for analytics and ensure auditability.

2) Architecture Overview

flowchart TB
  A[Landing / Bronze] --> B{PII Scanner}
  B -->|classified| C[Redaction / Tokenization]
  C --> D[Delta Curated]
  B -->|no PII| D
  D --> E[Access Zones / Views]
  E --> F[Analytics / BI]
  D --> G[Audit Tables]

3) PII Detection (Presidio + Spark NLP)

Use Presidio analyzers for entities (EMAIL, PHONE, CREDIT_CARD, PERSON, etc.).
Complement with domain regex and Spark NLP for names/locations if needed.
Confidence thresholds and context words to reduce false positives.

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = "Contact John at john.doe@example.com or +1-555-123-4567"
results = analyzer.analyze(text=text, language='en')
redacted = anonymizer.anonymize(text=text, anonymize_entities=[{"type": r.entity_type, "new_value": "<REDACTED>"} for r in results]).text
print(redacted)

4) Spark Integration Pattern

UDF wrapper calling Presidio per row (for small/medium texts); for large docs, batch per partition.
Structured outputs: keep original column, plus redacted_text and pii_entities (array of structs).
Deterministic tokenization for referential integrity: hash with salt for IDs.

from pyspark.sql import functions as F, types as T

def redact_text(text: str) -> str:
    # call analyzer/anonymizer; return redacted text
    return text  # placeholder

redact_udf = F.udf(redact_text, T.StringType())

df_out = (
    df_in
    .withColumn("redacted_text", redact_udf(F.col("text")))
)

5) Governance & Audit

Write detections to pii_detection_log with columns: id, entity_type, start, end, confidence, doc_id, ds.
Track policy decisions (mask/tokenize/pass) and version the policy set.
Access via views: analysts see redacted columns by default; elevated roles can request reveal with approvals.

6) Quality & Monitoring

Metrics: percent docs with PII, entity counts by type, redaction coverage.
Drift detection on entity distributions to catch model/pattern degradation.
Sampling UI for manual review (Streamlit or simple web).

7) Operational Considerations

Throughput: consider broadcasted dictionaries/context words; avoid heavy Python UDFs where possible (Scala UDFs or native libs faster).
Cost: cache pre-processing, process incrementally, skip re-redaction on unchanged docs (idempotence).
Security: store salts/keys in secret scopes; lock down raw zones.

8) Minimal Policy Example

version: 0.3.0
entities:
  - type: EMAIL
    action: redact
  - type: PHONE_NUMBER
    action: redact
  - type: PERSON
    action: tokenize
    token_salt_secret: keyvault://pii-tokens/person-salt

9) Rollout Plan

Pilot on one high‑value dataset (support tickets or transcripts).
Add governance hooks (policy version table, audit writes, views).
Expand coverage by domain; tune thresholds; add sampling UI.

Pragmatic takeaway: privacy-aware pipelines protect users and your org while keeping data useful. Bake in policy, audit, and performance from day one.

1) Objectives#

2) Architecture Overview#

3) PII Detection (Presidio + Spark NLP)#

4) Spark Integration Pattern#

5) Governance & Audit#

6) Quality & Monitoring#

7) Operational Considerations#

8) Minimal Policy Example#

9) Rollout Plan#