Surrogator 🐊

Notes before Usage
Configuration and Run

Content

Notes before Usage

This is the Surrogator, a Python-based framework designed to enhance privacy in German language clinical text documents by replacing pre-annotated and pre-processed sensitive information by replacing it with privacy-preserving placeholders.

The Surrogator's processes are based on pre-annotated PII (Personally Identifiable Information) and was developed during the GeMTeX project. The input is build by pre-annotated PII annotations from the INCEpTION annotation plattform within the UIMA Cas format. UIMA Cas is handled by DKPro-cassis and also PyCaprio.

Currently, the pipeline is designed to generate placeholders / surrogates for these specific categories automatically. Any remaining types of sensitive information are addressed manually during a subsequent quality control step.

Used PII Annotation Scheme

The used annotation scheme of the PII is based on the GeMTeX de-identification type-system (annotation-layer):

NAME
- NAME_PATIENT
- NAME_RELATIVE
- NAME_DOCTOR
- NAME_EXT
- NAME_USERNAME
- NAME_TITLE
DATE
- DATE_BIRTH
- DATE_DEATH
- DATE
AGE
LOCATION
- LOCATION_STREET
- LOCATION_CITY
- LOCATION_ZIP
- LOCATION_COUNTRY
- LOCATION_STATE
- LOCATION_HOSPITAL
- LOCATION_ORGANIZATION
- LOCATION_OTHER
ID
CONTACT
- CONTACT_PHONE
- CONTACT_EMAIL
- CONTACT_FAX
- CONTACT_URL
PROFESSION
OTHER

In alignment with the Datenschutz-Konzept of the Medizininformatik-Initiative, there is a specific focus on the following types of sensitive information:

Names
Date of Birth
Date of Death
Address details
Identifiers (e.g., insurance numbers, patient IDs from the hospital information system)

De-Identification Workflow within the Surrogator

Pre-Tagging: Clinical documents are first pre-annotated by an automated system marking potential PII.
Manual Annotation and Curation: A 2+1 review process is applied within the INCEpTION annotation platform. Two annotators independently review and correct the pre-annotations. Subsequently, a curator harmonizes the two versions, resolves discrepancies, and ensures the final, quality-assured annotation.
Export: The completed annotation projects are exported from INCEpTION in the UIMA CAS JSON format. This format contains both the original text and the 7associated PII annotations with their types (e.g., NAME PATIENT, DATE BIRTH) and exact positions in the text.

Workflow of Surrogator

Step 0: Input and Data Preparation

The annotations from the de-identification process, along with their corresponding curations, are required.
Export the annotations of the full annotation project (it is adapted from the inception-reporting-dashboard, do not use the Curation Export Mode) and ensure the format is set to UIMA Cas JSON.
Commandline and experimental mode allows the input of individual UIMA Cas JSON files.
Example directory with 2 test projects test_data:
- edge case examples as plain text: test_data/deid-test-doc
- 2 test projects
  - test_data/projects including 2 INCEpTION importable projects as input for this surrogator pipeline
    1. edge case snippets with annotations
    2. GraSCCo with annotations

Step 1: Assurance step

Before replacing sensitive entities in the text with surrogates, we recommend conducting an assurance step / quality control step. This ensures that all sensitive entity annotations are accurately processed and appropriate surrogates can be generated. Some annotated entities may require manual inspection.

Categories Automatically Handled by Replacement Modes

The following categories are automatically processed by all replacement modes (see supported modes):

NAME (including all sub-categories)
DATE_BIRTH and DATE_DEATH (other DATE annotations are not prioritized during GeMTeX processing)
LOCATION
ID
CONTACT

Categories Requiring Manual Inspection

The following categories are summarized in a tabular structure and require manual review. In certain cases, it may be necessary to exclude a document from further processing if needed:

AGE: Any age above 89 should not permissible.
PROFESSION: This category may contain sensitive information if the individual has an identifiable job or is a public figure (e.g., a mayor or minister).
OTHER: Requires review of the annotated document to ensure accuracy; annotations may need to be adjusted.
LOCATION_OTHER: This category may contain sensitive identifying information and should be carefully reviewed.

Examples of Lookups Using a Table Structure

Refer to the table structure with example GraSCCo annotations (→ test_data/export_curated_documents_v2.zip:

A list with corpus_details
A list with corpus documents
1. Document List: Lists all documents in the corpus.
2. Inclusion Toggle: Allows toggling documents between inclusion and exclusion from the corpus based on manually reviewed entities.
  - Documents marked with 1 are included in the corpus for further processing.
  - Documents with an OTHER annotation are automatically excluded and marked with 0. This value can be manually adjusted if a document should be re-included.
This table serves as the input for the subsequent surrogate step. It must be manually reviewed and adjusted as it determines which documents will proceed to the next processing stage and be part of the final corpus.

Example:

document	part_of_corpus
Stölzl.txt	1
Rieser.txt	1
...	...
Meyr.txt	0
Dewald.txt	1

A list with statistics
Quality control json file
Summary with all the reports in (.md file)

The output of a quality control of a project is stored in a new created directory like private/private-'timestamp-key-of-run'/'project-name'.

Step 2: Surrogate

This pipeline provides the following modes, each offering a distinct approach to replacing sensitive information with surrogates.

gemtex → suggested in GeMTeX
- Placeholder notation for preserving identity without using real names
  - Example:
    - Beate Albers → [** NAME_PATIENT FR7CR8 **]
      - NAME_PATIENT : entity
      - FR7CR8 : key
Wir berichten über lhre Patientin [** NAME_PATIENT FR7CR8 **] (* [** DATE_BIRTH 01.04.1997 **]), die sich vom 19.3. bis zum 7.5.2029 in unserer stat. Behandlung befand.
- This mode supports reversing the surrogate replacement process. Each replaced entity is assigned a unique key that stores the original value. These mappings are saved in a JSON file, example
  
  Note: This file is critical and must not be deleted, as it will be required in a later step.

      "Albers.txt": {
        "filename_orig": "Albers.txt",
        "annotations": {
          "NAME_PATIENT": {
            "WV7IT2": "Albers",
            "DU3DE3": "Beate Albers"
          },
          "DATE_BIRTH": {
            "01.04.1997": "4.4.1997"
          },
          "NAME_TITLE": {
            "EV2DL0": "Dr.med.",
            "AX9KF0": "Dr."
          },
          "NAME_DOCTOR": {
            "KS1EU0": "Siewert",
            "BW8TQ7": "Bernwart Schulze"
          }
        }
      },

fictive
- Surrogation via fictitious name replacements
  - Example:
    - Beate Albers → Tina Schmitd
      - NAME_PATIENT : entity
      - Tina Smith : key
Wir berichten über lhre Patientin Tina Schmitd (* 01.04.1997), die sich vom 12.3. bis zum 30.4.2029 in unserer stat. Behandlung befand.
- This mode supports reversing the surrogate replacement process. Each replaced entity is also assigned a unique key that stores the original value. These mappings are saved in a JSON file.
- For a detailed documentation of the replacement logic, see doc/Surrogator_Technical_Description.pdf
Note
- Every surrogate process is running with a quality control with all outputs.
- Documents with an OTHER annotation or a wrong annotation (marked as NONE) is excluded and not processed during the surrogate process!
The output of a run is stored in 2 ways:
- public files of a project are stored in a new created directory like public/public-'timestamp-key-of-run'/'project-name'
  - all new created text files
- private files of a project are stored in a new created directory like private/private-'timestamp-key-of-run'/'project-name'.
  - a directory with quality control of a run
  - a directory with cas files

Configuration and Run

Quick entry with End-to-End example

Preparation

Install Python 3.11;
It is preferred, to use a virtual environment
Install the following packages via Pip, see pyproject.toml or run pip install .

        pandas~=2.2.2
        dkpro-cassis
        pycaprio~=0.3.0
        streamlit
        toml~=0.10.2
        mdutils~=1.6.0
        tabulate~=0.9.0
        streamlit_ext
        python-dateutil~=2.9.0
        requests~=2.32.3
        schwifty
        gender-guesser
        spacy
        joblib
        sentence-transformers
        Levenshtein
        scikit-learn
        openpyxl
        overpy
        anytree

usage with docker
- run sudo docker build -t gemtex/surrogator:0.3.0 .
- see images sudo docker images
- run sudo docker compose -f docker-compose.yml up

Local Usage

Input: zipped and curated INCEpTION annotation projects in 1 directory with GeMTeX PII annotations, example: test_data/projects

Run Step 1: task `quality_control`

Run: python surrogator.py -qc -p path_to_projects or
Run: python surrogator.py --quality_control -p path_to_projects
run quality control python surrogator.py -qc -p test_data/projects
Local run in a terminal: python surrogator.py configs/parameters_quality_control.conf

The output is stored in (created) directories:

private : archive @ Data Integration Center for every run a private directory is created, containing
- the new created cas files in cas-project_name-timestamp_key
- a directory with statistics of quality control output
- for modes gemtex and fictive this repo contains 2 json files with the mapping of original PII and the surrogated PII:
  - nested version: 'common' json formatted file, example: "beruf_einrichtung.txt": { "filename_orig": "beruf_einrichtung.txt", "annotations": { "NAME_PATIENT": { "HP7SL6": "Andreas Fleischmann" }, "LOCATION_ORGANIZATION": { "PE8QX5": "Schlachhof Schlacht-Gut" } } },
    - flatted version: no nesting, table formatted, better input for Pseudonym Management Tools "project-deid-test-data-1-2025-09-18-160813.zip-**-beruf_einrichtung.txt-**-NAME_PATIENT-**-HP7SL6": "Andreas Fleischmann", "project-deid-test-data-1-2025-09-18-160813.zip-**-beruf_einrichtung.txt-**-LOCATION_ORGANIZATION-**-PE8QX5": "Schlachhof Schlacht-Gut", "project-deid-test-data-1-2025-09-18-160813.zip-**-contact.txt-**-LOCATION_CITY-**-GT9GP1": "Leipzig",
public : for further usage (research, LLM training, semantic annotation, ...)
- only new generated text files from the projects
test_data with private and public

Run Step 2: task `surrogate`

Run with mode x python surrogator.py -x -p path_to_projects
Run with mode entity python surrogator.py -e -p path_to_projects
Run with mode gemtex python surrogator.py -g -p path_to_projects
Run with mode fictive python surrogator.py -f -p path_to_projects
- NOTE: if you want, that all DATE annotations (incl. DATE_BIRTH and DATE_DEATH) are shifted, use the extension -d and an integer value,
- example: python surrogator.py -f -p path_to_projects -d 7 as a shift of seven days.
- If there is no shift defined, there is no shift processed and DATE_BIRTH and DATE_DEATH the first day of the quarter.
- If a date is not processable, the surrogate replacement is DATE.
NOTE: if there is a UIMA Cas file with annotations in your project path, files will be processed separately.
- example: python surrogator.py -f -p path_to_projects (see test_data/grascco_examples)

Run via Webservice

Run: python surrogator.py -ws or
Run: python surrogator.py --webservice
Here for more details of usage of Web Service.

Remote Usage via Webservice (API Mode of Webservice)

Download INCEpTION annotation plattform
Extend settings.properties with remote-api.enabled=true, follow instruction for ROLE_REMOTE of the admin guide of INCEpTION
Start INCEpTION.
Configure an INCEpTION project with users and documents, add ROLE_REMOTE in your INCEpTION project(s).
The remote usage is running also locally.

Docker Deployment

General Docker Setup

To deploy the application docker-compose or a docker binary, which is modern enough to support the sub-command 'compose' are required. The docker setup consists of two containers:

gemtexsurrogator: The application itself
overpass_api: A local OpenStreetMap (OSM) server, which allows to query OSM maps without leaking information about the raw data to the internet.

Docker Sizing

50 GB Disk Space: Mostly for OSM map data, peaks during initial import and for the models integrated in the docker image.
8 GB RAM: Mostly during initial data load of OSM.
2 CPU cores: The applications are mostly single-thread but will profit form a second core during database indexing.
3 hours setup time: The initial load of the Docker images and the map data takes about 10 min with 1 GBit/s internet connection. After that OSM container will run for multiple hours a process called 'update_database' followed by a run of 'osm3s_query' to import and index the data for later queries.

Docker Deployment

To deploy with docker do this:

Build the image for the application container and note down the final image ID for tagging:

    $ docker build .
    
    => => writing image sha256:a429b43516db046d8e1a6ba5d8da46ebd6c4af1a85bdf983c4a2c017fb6a7b89

Tag the image with the name used in your docker-compose.yml, e.g.:

corpora_NTS.json

    $ grep image docker-compose.yml 
        image: gemtex/surrogator:0.3.0
        image: wiktorn/overpass-api:latest
    $ docker tag a429b43516db046d8e1a6ba5d8da46ebd6c4af1a85bdf983c4a2c017fb6a7b89 gemtex/surrogator:0.3.0

Run the containers:

    $ docker-compose up -d

On the first start the overpass-api will download and index its database. This will take about 3 hours. On later restarts it will only re-index its database for about 10 min. The application should ready and operate a full speed once you see a message like this one:

It took xxx to run the loop. Desired load is: 1%. Sleeping: yyyyy

To stop the application, go again to the folder containing the docker-compose.yml and run:

    docker-compose stop

Docker air-gapped setup

The above setup should also work for an air-gapped setup (tbc). However, the images must be downloaded/pulled in advance on a system with internet access. To build the gemtex/surrogator just execute docker build . as described above. To pull overpass-api image run:

    docker pull wiktorn/overpass-api:latest

Now both images can be saved like this:

    docker save wiktorn/overpass-api:latest | gzip -c >overpass_image.tgz
    docker save gemtex/surrogator | gzip -c > gemtex_surrogater_image.tgz

The size of the images will be around 20 GByte in total. Both images can be transferred to the air-gapped system by any means available and loaded there using the docker load < ...tgz command.

Also, the initial loading of the maps folder of the maps folder can not work for the air-gapped environment. So the maps folder must be copied from a site with internet access after the initial load completed (about 3 hours). The maps folder has a size of about 16 GB.

Now the images can be tagged on the target system and docker-compose works as described above.

Further Notes

The processing is logged. Log files stored under 'log' directory.
The directory resources contains resources used during processing. Do not delete this repo. It is filled during installation by load of language modes (SentenceTransformers and a spaCy based model.)

More information about the GeMTeX's de-identification

Lohr C, Matthies F, Faller J, Modersohn L, Riedel A, Hahn U, Kiser R, Boeker M, Meineke F. De-Identifying GRASCCO - A Pilot Study for the De-Identification of the German Medical Text Project (GeMTeX) Corpus. Stud Health Technol Inform. 2024 Aug 30;317:171-179. doi: 10.3233/SHTI240853. PMID: 39234720.
Lohr C, Faller J, Riedel A, Nguyen HM, Wolfien M, Hofenbitzer J, Modersohn L, Romberg J, Prasser F, Omeirat J, Wen Y, Galusch O, Hahn U, Seiferling M, Dieterich C, Klügl P, Matthies F, Kind J, Boeker M, Löffler M, Meineke F. GeMTeX's De-Identification in Action: Lessons Learned & Devil's Details. Stud Health Technol Inform. 2025 Sep 3;331:274-282. doi: 10.3233/SHTI251406. PMID: 40899551.

Contact

If you have further questions, do not hesitate to contact Christina Lohr.

Name		Name	Last commit message	Last commit date
Latest commit History 394 Commits
Surrogator		Surrogator
doc		doc
resources		resources
test_data		test_data
.gitignore		.gitignore
CONTRIBUTORS.txt		CONTRIBUTORS.txt
Dockerfile		Dockerfile
End-to-End-Example.md		End-to-End-Example.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE.txt		NOTICE.txt
Readme.md		Readme.md
Readme_Webservice.md		Readme_Webservice.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
surrogator.py		surrogator.py
workflow.png		workflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Surrogator 🐊

Notes before Usage

Used PII Annotation Scheme

De-Identification Workflow within the Surrogator

Workflow of Surrogator

Step 0: Input and Data Preparation

Step 1: Assurance step

Categories Automatically Handled by Replacement Modes

Categories Requiring Manual Inspection

Examples of Lookups Using a Table Structure

Step 2: Surrogate

Configuration and Run

Preparation

Local Usage

Run Step 1: task `quality_control`

Run Step 2: task `surrogate`

Run via Webservice

Remote Usage via Webservice (API Mode of Webservice)

Docker Deployment

General Docker Setup

Docker Sizing

Docker Deployment

Docker air-gapped setup

Further Notes

More information about the GeMTeX's de-identification

Contact

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

chlor/GeMTeX-Pseudonymization

Folders and files

Latest commit

History

Repository files navigation

Surrogator 🐊

Notes before Usage

Used PII Annotation Scheme

De-Identification Workflow within the Surrogator

Workflow of Surrogator

Step 0: Input and Data Preparation

Step 1: Assurance step

Categories Automatically Handled by Replacement Modes

Categories Requiring Manual Inspection

Examples of Lookups Using a Table Structure

Step 2: Surrogate

Configuration and Run

Preparation

Local Usage

Run Step 1: task quality_control

Run Step 2: task surrogate

Run via Webservice

Remote Usage via Webservice (API Mode of Webservice)

Docker Deployment

General Docker Setup

Docker Sizing

Docker Deployment

Docker air-gapped setup

Further Notes

More information about the GeMTeX's de-identification

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Run Step 1: task `quality_control`

Run Step 2: task `surrogate`

Packages