A scalable data security tagging processor that integrates multiple PII detection engines (Microsoft Presidio and OpenPipe) into a unified gRPC service and Web UI.
- Multi-Processor Support: Choose between Microsoft Presidio and OpenPipe PII Redaction.
- File Support: Extract text from PDF, DOCX, XLSX, PPTX, ZIP, TXT, and Images (OCR).
- Web Interface: Clean, modern UI for interactive testing and analysis.
- Standardized API: gRPC interface using
processor.proto.
- Python: 3.11+
- Poetry: Dependency management (Install Instructions)
- Tesseract OCR: Required for image analysis.
- Mac:
brew install tesseract - Ubuntu:
sudo apt-get install tesseract-ocr
- Mac:
- Clone the repository.
- Install dependencies using Poetry:
Note: This will also install the required spaCy model for Presidio.
poetry install
You need to run two components: the gRPC Server (backend logic) and the Web Gateway (HTTP/UI).
This server handles the actual PII detection.
poetry run python src/server.pyListens on port 50051.
This serves the UI and proxies requests to the gRPC server.
poetry run python src/web_gateway.pyListens on port 8000.
- Open http://localhost:8000 in your browser.
- Select Content Source: Choose "Microsoft Presidio" or "OpenPipe PII Redaction".
- Input Data:
- Paste Text: Enter text directly into the text area.
- Upload File: Drag and drop or select a file (PDF, DOCX, Images, etc.).
- Click Analyze Content.
- View results in the panel below. You can also inspect the raw extracted text.
Run the unit test suite using pytest:
poetry run pytestsrc/server.py: gRPC server implementation.src/web_gateway.py: FastAPI gateway and static file server.src/processors/: Processor implementations (Presidio, OpenPipe).src/static/: Frontend assets (HTML, CSS, JS).protos/: Protocol Buffer definitions.
This project is licensed under the MIT License - see the LICENSE file for details. Copyright (c) 2025 Dana Morris ([email protected])