RedactifyAI is a Python package for detecting and anonymizing sensitive Personally Identifiable Information (PII) in textual data using Microsoft’s Presidio and Apache Spark.
You can install RedactifyAI from PyPI or by building the wheel file locally.
pip install redactify-ai
git clone https://github.com/your-repo/redactify-ai.git
cd redactify-ai
rm -rf build dist *.egg-info
python setup.py sdist bdist_wheel
pip install dist/redactify_ai-0.0.1-py3-none-any.whl
config.yaml
file with Presidio configuration (e.g., entities, anonymization rules).
presidio: entities: - PERSON - PHONE_NUMBER - EMAIL_ADDRESS - LOCATION - DATE_TIME - CREDIT_CARD language: en score_threshold: 0.6 mask_character: "*" spacy_model: en_core_web_lg spacy_model_dir: /path/to/model/
from redactify_ai.config import load_config
from redactify_ai.processor import PresidioDLPProcessor
config = load_config("config.yaml")
processor = PresidioDLPProcessor(config)
from redactify_ai.utils import anonymize_text_udf
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PresidioDLP").getOrCreate()
data = [("Hi, I'm John Doe. Email me at john.doe@gmail.com.",)]
df = spark.createDataFrame(data, ["transcripts"])
anonymize_udf = anonymize_text_udf(processor)
df_redacted = df.withColumn("transcripts_redacted", anonymize_udf(df["transcripts"]))
df_redacted.show(truncate=False)
To run the pipeline script provided in this repository:
python run_pipeline.py
An end-to-end test is provided using Docker. This verifies the full pipeline, including Spark and real NLP models.
test_config.yaml
, test_pipeline_integration.py
, and the project source code are in your working directory.docker build -t redactify-test .
docker run --rm -w /app redactify-test bash -c "
python setup.py sdist bdist_wheel &&
pip install dist/redactify_ai-0.0.1-py3-none-any.whl &&
pytest tests/test_pipeline_integration.py --maxfail=1 --disable-warnings --tb=short
"
The test will pass if the sensitive information is successfully redacted from output. Printed results will show mask characters (*
) in place of PII.
Contributions are welcome! Please create issues or pull requests if you find bugs or would like to add new features.
RedactifyAI is licensed under the MIT License. See the LICENSE file for details.