RedactifyAI

RedactifyAI is a Python package for detecting and anonymizing sensitive Personally Identifiable Information (PII) in textual data using Microsoft’s Presidio and Apache Spark.

Key Features

Integration with Presidio to detect and anonymize PII such as names, emails, phone numbers, and more.
Spark-powered processing for scalable anonymization using PySpark.
Custom recognizers to extend PII detection for specific needs.

Models

en_core_web_lg

Installation

You can install RedactifyAI from PyPI or by building the wheel file locally.

Install from PyPI

pip install redactify-ai

Build Locally

Clone the repository:

git clone https://github.com/your-repo/redactify-ai.git
cd redactify-ai

Build the wheel:

rm -rf build dist *.egg-info
python setup.py sdist bdist_wheel

Install the wheel:

pip install dist/redactify_ai-0.0.1-py3-none-any.whl

Usage

Configuration:
Prepare a config.yaml file with Presidio configuration (e.g., entities, anonymization rules).

presidio:
   entities:
      - PERSON
      - PHONE_NUMBER
      - EMAIL_ADDRESS
      - LOCATION
      - DATE_TIME
      - CREDIT_CARD
   language: en
   score_threshold: 0.6
   mask_character: "*"
   spacy_model: en_core_web_lg
   spacy_model_dir: /path/to/model/

Create a Processor:

from redactify_ai.config import load_config
from redactify_ai.processor import PresidioDLPProcessor

config = load_config("config.yaml")
processor = PresidioDLPProcessor(config)

Anonymize DataFrame with PySpark:

from redactify_ai.utils import anonymize_text_udf
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PresidioDLP").getOrCreate()
data = [("Hi, I'm John Doe. Email me at john.doe@gmail.com.",)]
df = spark.createDataFrame(data, ["transcripts"])
anonymize_udf = anonymize_text_udf(processor)
df_redacted = df.withColumn("transcripts_redacted", anonymize_udf(df["transcripts"]))
df_redacted.show(truncate=False)

Running the Pipeline

To run the pipeline script provided in this repository:

python run_pipeline.py

End-to-End Integration Testing

An end-to-end test is provided using Docker. This verifies the full pipeline, including Spark and real NLP models.

Ensure Docker is installed.
Make sure test_config.yaml, test_pipeline_integration.py, and the project source code are in your working directory.

Build the Docker image:
```
docker build -t redactify-test .
```

Run the test suite:

docker run --rm -w /app redactify-test bash -c "
    python setup.py sdist bdist_wheel &&
    pip install dist/redactify_ai-0.0.1-py3-none-any.whl &&
    pytest tests/test_pipeline_integration.py --maxfail=1 --disable-warnings --tb=short
"

The test will pass if the sensitive information is successfully redacted from output. Printed results will show mask characters (*) in place of PII.

Contributing

Contributions are welcome! Please create issues or pull requests if you find bugs or would like to add new features.

License

RedactifyAI is licensed under the MIT License. See the LICENSE file for details.

Please, contact the developer to request access to the private repo for details: RedactifyAI Repository