This post shares a practical architecture for privacy-aware data processing on Spark/Delta with PII discovery, redaction, and auditability. It reflects patterns used to replace external DLP APIs with in‑house pipelines.
1) Objectives
- Discover PII across semi/unstructured data (text/JSON).
- Redact or tokenize with policy-driven transformations.
- Preserve utility for analytics and ensure auditability.
2) Architecture Overview
flowchart TB
A[Landing / Bronze] --> B{PII Scanner}
B -->|classified| C[Redaction / Tokenization]
C --> D[Delta Curated]
B -->|no PII| D
D --> E[Access Zones / Views]
E --> F[Analytics / BI]
D --> G[Audit Tables]
3) PII Detection (Presidio + Spark NLP)
- Use Presidio analyzers for entities (EMAIL, PHONE, CREDIT_CARD, PERSON, etc.).
- Complement with domain regex and Spark NLP for names/locations if needed.
- Confidence thresholds and context words to reduce false positives.
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
text = "Contact John at john.doe@example.com or +1-555-123-4567"
results = analyzer.analyze(text=text, language='en')
redacted = anonymizer.anonymize(text=text, anonymize_entities=[{"type": r.entity_type, "new_value": "<REDACTED>"} for r in results]).text
print(redacted)
4) Spark Integration Pattern
- UDF wrapper calling Presidio per row (for small/medium texts); for large docs, batch per partition.
- Structured outputs: keep original column, plus
redacted_text
andpii_entities
(array of structs). - Deterministic tokenization for referential integrity: hash with salt for IDs.
from pyspark.sql import functions as F, types as T
def redact_text(text: str) -> str:
# call analyzer/anonymizer; return redacted text
return text # placeholder
redact_udf = F.udf(redact_text, T.StringType())
df_out = (
df_in
.withColumn("redacted_text", redact_udf(F.col("text")))
)
5) Governance & Audit
- Write detections to
pii_detection_log
with columns:id
,entity_type
,start
,end
,confidence
,doc_id
,ds
. - Track policy decisions (mask/tokenize/pass) and version the policy set.
- Access via views: analysts see redacted columns by default; elevated roles can request reveal with approvals.
6) Quality & Monitoring
- Metrics: percent docs with PII, entity counts by type, redaction coverage.
- Drift detection on entity distributions to catch model/pattern degradation.
- Sampling UI for manual review (Streamlit or simple web).
7) Operational Considerations
- Throughput: consider broadcasted dictionaries/context words; avoid heavy Python UDFs where possible (Scala UDFs or native libs faster).
- Cost: cache pre-processing, process incrementally, skip re-redaction on unchanged docs (idempotence).
- Security: store salts/keys in secret scopes; lock down raw zones.
8) Minimal Policy Example
version: 0.3.0
entities:
- type: EMAIL
action: redact
- type: PHONE_NUMBER
action: redact
- type: PERSON
action: tokenize
token_salt_secret: keyvault://pii-tokens/person-salt
9) Rollout Plan
- Pilot on one high‑value dataset (support tickets or transcripts).
- Add governance hooks (policy version table, audit writes, views).
- Expand coverage by domain; tune thresholds; add sampling UI.
Pragmatic takeaway: privacy-aware pipelines protect users and your org while keeping data useful. Bake in policy, audit, and performance from day one.