RedactifyAI

RedactifyAI is a Python package for detecting and anonymizing sensitive Personally Identifiable Information (PII) in textual data using Microsoft’s Presidio and Apache Spark.

Key Features

Models


Installation

You can install RedactifyAI from PyPI or by building the wheel file locally.

Install from PyPI

pip install redactify-ai

Build Locally

  1. Clone the repository:
    git clone https://github.com/your-repo/redactify-ai.git
    cd redactify-ai
    
  2. Build the wheel:
    rm -rf build dist *.egg-info
    python setup.py sdist bdist_wheel
    
  3. Install the wheel:
    pip install dist/redactify_ai-0.0.1-py3-none-any.whl
    

Usage

  1. Configuration:
    Prepare a config.yaml file with Presidio configuration (e.g., entities, anonymization rules).
    presidio:
       entities:
          - PERSON
          - PHONE_NUMBER
          - EMAIL_ADDRESS
          - LOCATION
          - DATE_TIME
          - CREDIT_CARD
       language: en
       score_threshold: 0.6
       mask_character: "*"
       spacy_model: en_core_web_lg
       spacy_model_dir: /path/to/model/
            
  2. Create a Processor:
    from redactify_ai.config import load_config
    from redactify_ai.processor import PresidioDLPProcessor
    
    config = load_config("config.yaml")
    processor = PresidioDLPProcessor(config)
    
  3. Anonymize DataFrame with PySpark:
    from redactify_ai.utils import anonymize_text_udf
    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder.appName("PresidioDLP").getOrCreate()
    data = [("Hi, I'm John Doe. Email me at john.doe@gmail.com.",)]
    df = spark.createDataFrame(data, ["transcripts"])
    anonymize_udf = anonymize_text_udf(processor)
    df_redacted = df.withColumn("transcripts_redacted", anonymize_udf(df["transcripts"]))
    df_redacted.show(truncate=False)
    

Running the Pipeline

To run the pipeline script provided in this repository:

python run_pipeline.py

End-to-End Integration Testing

An end-to-end test is provided using Docker. This verifies the full pipeline, including Spark and real NLP models.

  1. Build the Docker image:
    docker build -t redactify-test .
  2. Run the test suite:
    docker run --rm -w /app redactify-test bash -c "
        python setup.py sdist bdist_wheel &&
        pip install dist/redactify_ai-0.0.1-py3-none-any.whl &&
        pytest tests/test_pipeline_integration.py --maxfail=1 --disable-warnings --tb=short
    "

The test will pass if the sensitive information is successfully redacted from output. Printed results will show mask characters (*) in place of PII.

Contributing

Contributions are welcome! Please create issues or pull requests if you find bugs or would like to add new features.

License

RedactifyAI is licensed under the MIT License. See the LICENSE file for details.

Please, contact the developer to request access to the private repo for details:   RedactifyAI Repository