Skip to content

damorris25/custom-tag-processors

Custom Tag Processors

A scalable data security tagging processor that integrates multiple PII detection engines (Microsoft Presidio and OpenPipe) into a unified gRPC service and Web UI.

Features

  • Multi-Processor Support: Choose between Microsoft Presidio and OpenPipe PII Redaction.
  • File Support: Extract text from PDF, DOCX, XLSX, PPTX, ZIP, TXT, and Images (OCR).
  • Web Interface: Clean, modern UI for interactive testing and analysis.
  • Standardized API: gRPC interface using processor.proto.

Prerequisites

  • Python: 3.11+
  • Poetry: Dependency management (Install Instructions)
  • Tesseract OCR: Required for image analysis.
    • Mac: brew install tesseract
    • Ubuntu: sudo apt-get install tesseract-ocr

Installation

  1. Clone the repository.
  2. Install dependencies using Poetry:
    poetry install
    Note: This will also install the required spaCy model for Presidio.

Running the Application

You need to run two components: the gRPC Server (backend logic) and the Web Gateway (HTTP/UI).

1. Start the gRPC Server

This server handles the actual PII detection.

poetry run python src/server.py

Listens on port 50051.

2. Start the Web Gateway

This serves the UI and proxies requests to the gRPC server.

poetry run python src/web_gateway.py

Listens on port 8000.

Using the Web UI

  1. Open http://localhost:8000 in your browser.
  2. Select Content Source: Choose "Microsoft Presidio" or "OpenPipe PII Redaction".
  3. Input Data:
    • Paste Text: Enter text directly into the text area.
    • Upload File: Drag and drop or select a file (PDF, DOCX, Images, etc.).
  4. Click Analyze Content.
  5. View results in the panel below. You can also inspect the raw extracted text.

Running Tests

Run the unit test suite using pytest:

poetry run pytest

Project Structure

  • src/server.py: gRPC server implementation.
  • src/web_gateway.py: FastAPI gateway and static file server.
  • src/processors/: Processor implementations (Presidio, OpenPipe).
  • src/static/: Frontend assets (HTML, CSS, JS).
  • protos/: Protocol Buffer definitions.

License

This project is licensed under the MIT License - see the LICENSE file for details. Copyright (c) 2025 Dana Morris ([email protected])

About

A data tagging processor that uses Presidio to extract tags for a document.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published