Content
This is the Surrogator, a Python-based framework designed to enhance privacy in German language clinical text documents by replacing pre-annotated and pre-processed sensitive information by replacing it with privacy-preserving placeholders.
The Surrogator's processes are based on pre-annotated PII (Personally Identifiable Information) and was developed during the GeMTeX project. The input is build by pre-annotated PII annotations from the INCEpTION annotation plattform within the UIMA Cas format. UIMA Cas is handled by DKPro-cassis and also PyCaprio.
Currently, the pipeline is designed to generate placeholders / surrogates for these specific categories automatically. Any remaining types of sensitive information are addressed manually during a subsequent quality control step.
The used annotation scheme of the PII is based on the GeMTeX de-identification type-system (annotation-layer):
NAMENAME_PATIENTNAME_RELATIVENAME_DOCTORNAME_EXTNAME_USERNAMENAME_TITLE
DATEDATE_BIRTHDATE_DEATHDATE
AGELOCATIONLOCATION_STREETLOCATION_CITYLOCATION_ZIPLOCATION_COUNTRYLOCATION_STATELOCATION_HOSPITALLOCATION_ORGANIZATIONLOCATION_OTHER
IDCONTACTCONTACT_PHONECONTACT_EMAILCONTACT_FAXCONTACT_URL
PROFESSIONOTHER
In alignment with the Datenschutz-Konzept of the Medizininformatik-Initiative, there is a specific focus on the following types of sensitive information:
- Names
- Date of Birth
- Date of Death
- Address details
- Identifiers (e.g., insurance numbers, patient IDs from the hospital information system)
- Pre-Tagging: Clinical documents are first pre-annotated by an automated system marking potential PII.
- Manual Annotation and Curation: A 2+1 review process is applied within the INCEpTION annotation platform. Two annotators independently review and correct the pre-annotations. Subsequently, a curator harmonizes the two versions, resolves discrepancies, and ensures the final, quality-assured annotation.
- Export: The completed annotation projects are exported from
INCEpTION in the
UIMA CAS JSONformat. This format contains both the original text and the 7associated PII annotations with their types (e.g.,NAME PATIENT,DATE BIRTH) and exact positions in the text.
- The annotations from the de-identification process, along with their corresponding curations, are required.
- Export the annotations of the full annotation project (it is adapted from the
inception-reporting-dashboard,
do not use the Curation Export Mode) and ensure the format is set to
UIMA Cas JSON. - Commandline and experimental mode allows the input of individual
UIMA Cas JSONfiles. - Example directory with 2 test projects test_data:
-
edge case examples as plain text: test_data/deid-test-doc
-
2 test projects
- test_data/projects including 2
INCEpTION importable projects as input for this surrogator
pipeline
- edge case snippets with annotations
- GraSCCo with annotations
- test_data/projects including 2
INCEpTION importable projects as input for this surrogator
pipeline
-
Before replacing sensitive entities in the text with surrogates, we recommend conducting an assurance step / quality control step. This ensures that all sensitive entity annotations are accurately processed and appropriate surrogates can be generated. Some annotated entities may require manual inspection.
The following categories are automatically processed by all replacement modes (see supported modes):
NAME(including all sub-categories)DATE_BIRTHandDATE_DEATH(otherDATEannotations are not prioritized during GeMTeX processing)LOCATIONIDCONTACT
The following categories are summarized in a tabular structure and require manual review. In certain cases, it may be necessary to exclude a document from further processing if needed:
AGE: Any age above 89 should not permissible.PROFESSION: This category may contain sensitive information if the individual has an identifiable job or is a public figure (e.g., a mayor or minister).OTHER: Requires review of the annotated document to ensure accuracy; annotations may need to be adjusted.LOCATION_OTHER: This category may contain sensitive identifying information and should be carefully reviewed.
Refer to the table structure with example GraSCCo annotations (→ test_data/export_curated_documents_v2.zip:
-
A list with corpus_details
-
A list with corpus documents
- Document List: Lists all documents in the corpus.
- Inclusion Toggle: Allows toggling documents between
inclusion and exclusion from the corpus based on manually
reviewed entities.
- Documents marked with
1are included in the corpus for further processing. - Documents with an
OTHERannotation are automatically excluded and marked with0. This value can be manually adjusted if a document should be re-included.
- Documents marked with
This table serves as the input for the subsequent surrogate step. It must be manually reviewed and adjusted as it determines which documents will proceed to the next processing stage and be part of the final corpus.
Example:
| document | part_of_corpus |
|---|---|
| Stölzl.txt | 1 |
| Rieser.txt | 1 |
| ... | ... |
| Meyr.txt | 0 |
| Dewald.txt | 1 |
- A list with statistics
- Quality control json file
- Summary with all the reports in (.md file)
The output of a quality control of a project is stored in a new created
directory like private/private-'timestamp-key-of-run'/'project-name'.
This pipeline provides the following modes, each offering a distinct approach to replacing sensitive information with surrogates.
-
gemtex→ suggested in GeMTeX- Placeholder notation for preserving identity without using real
names
- Example:
Beate Albers→[** NAME_PATIENT FR7CR8 **]NAME_PATIENT: entityFR7CR8: key
- Example:
Wir berichten über lhre Patientin [** NAME_PATIENT FR7CR8 **] (* [** DATE_BIRTH 01.04.1997 **]), die sich vom 19.3. bis zum 7.5.2029 in unserer stat. Behandlung befand.-
This mode supports reversing the surrogate replacement process. Each replaced entity is assigned a unique key that stores the original value. These mappings are saved in a
JSONfile, exampleNote: This file is critical and must not be deleted, as it will be required in a later step.
- Placeholder notation for preserving identity without using real
names
"Albers.txt": {
"filename_orig": "Albers.txt",
"annotations": {
"NAME_PATIENT": {
"WV7IT2": "Albers",
"DU3DE3": "Beate Albers"
},
"DATE_BIRTH": {
"01.04.1997": "4.4.1997"
},
"NAME_TITLE": {
"EV2DL0": "Dr.med.",
"AX9KF0": "Dr."
},
"NAME_DOCTOR": {
"KS1EU0": "Siewert",
"BW8TQ7": "Bernwart Schulze"
}
}
},
-
fictive- Surrogation via fictitious name replacements
- Example:
Beate Albers→Tina SchmitdNAME_PATIENT: entityTina Smith: key
- Example:
Wir berichten über lhre Patientin Tina Schmitd (* 01.04.1997), die sich vom 12.3. bis zum 30.4.2029 in unserer stat. Behandlung befand.- This mode supports reversing the surrogate replacement process.
Each replaced entity is also assigned a unique key that stores the
original value. These mappings are saved in a
JSONfile. - For a detailed documentation of the replacement logic, see doc/Surrogator_Technical_Description.pdf
- Surrogation via fictitious name replacements
-
Note
- Every surrogate process is running with a quality control with all outputs.
- Documents with an
OTHERannotation or a wrong annotation (marked asNONE) is excluded and not processed during the surrogate process!
-
The output of a run is stored in 2 ways:
- public files of a project are stored in a new created directory
like
public/public-'timestamp-key-of-run'/'project-name'- all new created text files
- private files of a project are stored in a new created directory
like
private/private-'timestamp-key-of-run'/'project-name'.- a directory with quality control of a run
- a directory with cas files
- public files of a project are stored in a new created directory
like
- Quick entry with End-to-End example
- Install Python 3.11;
- It is preferred, to use a virtual environment
- Install the following packages via
Pip, see
pyproject.toml or run
pip install .
pandas~=2.2.2
dkpro-cassis
pycaprio~=0.3.0
streamlit
toml~=0.10.2
mdutils~=1.6.0
tabulate~=0.9.0
streamlit_ext
python-dateutil~=2.9.0
requests~=2.32.3
schwifty
gender-guesser
spacy
joblib
sentence-transformers
Levenshtein
scikit-learn
openpyxl
overpy
anytree
- usage with docker
- run
sudo docker build -t gemtex/surrogator:0.3.0 . - see images
sudo docker images - run
sudo docker compose -f docker-compose.yml up
- run
- Input: zipped and curated INCEpTION annotation projects in 1 directory with GeMTeX PII annotations, example: test_data/projects
-
Run:
python surrogator.py -qc -p path_to_projectsor -
Run:
python surrogator.py --quality_control -p path_to_projects -
run quality control
python surrogator.py -qc -p test_data/projects -
Local run in a terminal:
python surrogator.py configs/parameters_quality_control.conf
The output is stored in (created) directories:
-
private: archive @ Data Integration Center for every run a private directory is created, containing- the new created cas files in cas-project_name-timestamp_key
- a directory with statistics of quality control output
- for modes gemtex and fictive this repo contains 2 json files with the mapping
of original PII and the surrogated PII:
- nested version: 'common' json formatted file, example:
"beruf_einrichtung.txt": { "filename_orig": "beruf_einrichtung.txt", "annotations": { "NAME_PATIENT": { "HP7SL6": "Andreas Fleischmann" }, "LOCATION_ORGANIZATION": { "PE8QX5": "Schlachhof Schlacht-Gut" } } },- flatted version: no nesting, table formatted, better input for Pseudonym Management Tools
"project-deid-test-data-1-2025-09-18-160813.zip-**-beruf_einrichtung.txt-**-NAME_PATIENT-**-HP7SL6": "Andreas Fleischmann", "project-deid-test-data-1-2025-09-18-160813.zip-**-beruf_einrichtung.txt-**-LOCATION_ORGANIZATION-**-PE8QX5": "Schlachhof Schlacht-Gut", "project-deid-test-data-1-2025-09-18-160813.zip-**-contact.txt-**-LOCATION_CITY-**-GT9GP1": "Leipzig",
- flatted version: no nesting, table formatted, better input for Pseudonym Management Tools
- nested version: 'common' json formatted file, example:
-
public: for further usage (research, LLM training, semantic annotation, ...)- only new generated text files from the projects
-
test_data with
privateandpublic
-
Run with mode x
python surrogator.py -x -p path_to_projects -
Run with mode entity
python surrogator.py -e -p path_to_projects -
Run with mode gemtex
python surrogator.py -g -p path_to_projects -
Run with mode fictive
python surrogator.py -f -p path_to_projects- NOTE: if you want, that all
DATEannotations (incl.DATE_BIRTHandDATE_DEATH) are shifted, use the extension-dand an integer value, - example:
python surrogator.py -f -p path_to_projects -d 7as a shift of seven days. - If there is no shift defined, there is no shift processed
and
DATE_BIRTHandDATE_DEATHthe first day of the quarter. - If a date is not processable, the surrogate replacement is
DATE.
- NOTE: if you want, that all
-
NOTE: if there is a
UIMA Casfile with annotations in your project path, files will be processed separately.- example:
python surrogator.py -f -p path_to_projects(seetest_data/grascco_examples)
- example:
- Run:
python surrogator.py -wsor - Run:
python surrogator.py --webservice - Here for more details of usage of Web Service.
- Download INCEpTION annotation plattform
- Extend
settings.propertieswithremote-api.enabled=true, follow instruction forROLE_REMOTEof the admin guide of INCEpTION - Start INCEpTION.
- Configure an INCEpTION project with users and documents, add
ROLE_REMOTEin your INCEpTION project(s). - The remote usage is running also locally.
To deploy the application docker-compose or a docker binary, which is modern enough to support the sub-command 'compose' are required. The docker setup consists of two containers:
- gemtexsurrogator: The application itself
- overpass_api: A local OpenStreetMap (OSM) server, which allows to query OSM maps without leaking information about the raw data to the internet.
- 50 GB Disk Space: Mostly for OSM map data, peaks during initial import and for the models integrated in the docker image.
- 8 GB RAM: Mostly during initial data load of OSM.
- 2 CPU cores: The applications are mostly single-thread but will profit form a second core during database indexing.
- 3 hours setup time: The initial load of the Docker images and the map data takes about 10 min with 1 GBit/s internet connection. After that OSM container will run for multiple hours a process called 'update_database' followed by a run of 'osm3s_query' to import and index the data for later queries.
To deploy with docker do this:
- Build the image for the application container and note down the final image ID for tagging:
$ docker build .
=> => writing image sha256:a429b43516db046d8e1a6ba5d8da46ebd6c4af1a85bdf983c4a2c017fb6a7b89
- Tag the image with the name used in your docker-compose.yml, e.g.:
$ grep image docker-compose.yml
image: gemtex/surrogator:0.3.0
image: wiktorn/overpass-api:latest
$ docker tag a429b43516db046d8e1a6ba5d8da46ebd6c4af1a85bdf983c4a2c017fb6a7b89 gemtex/surrogator:0.3.0
- Run the containers:
$ docker-compose up -d
On the first start the overpass-api will download and index its database. This will take about 3 hours. On later restarts it will only re-index its database for about 10 min. The application should ready and operate a full speed once you see a message like this one:
It took xxx to run the loop. Desired load is: 1%. Sleeping: yyyyy
To stop the application, go again to the folder containing the docker-compose.yml and run:
docker-compose stop
The above setup should also work for an air-gapped setup (tbc). However,
the images must be downloaded/pulled in advance on a system with
internet access. To build the gemtex/surrogator just execute
docker build . as described above. To pull overpass-api image run:
docker pull wiktorn/overpass-api:latest
Now both images can be saved like this:
docker save wiktorn/overpass-api:latest | gzip -c >overpass_image.tgz
docker save gemtex/surrogator | gzip -c > gemtex_surrogater_image.tgz
The size of the images will be around 20 GByte in total. Both images can
be transferred to the air-gapped system by any means available and
loaded there using the docker load < ...tgz command.
Also, the initial loading of the maps folder of the maps folder can not work for the air-gapped environment. So the maps folder must be copied from a site with internet access after the initial load completed (about 3 hours). The maps folder has a size of about 16 GB.
Now the images can be tagged on the target system and docker-compose
works as described above.
- The processing is logged. Log files stored under 'log' directory.
- The directory resources contains resources used during processing. Do not delete this repo. It is filled during installation by load of language modes (SentenceTransformers and a spaCy based model.)
- Lohr C, Matthies F, Faller J, Modersohn L, Riedel A, Hahn U, Kiser R, Boeker M, Meineke F. De-Identifying GRASCCO - A Pilot Study for the De-Identification of the German Medical Text Project (GeMTeX) Corpus. Stud Health Technol Inform. 2024 Aug 30;317:171-179. doi: 10.3233/SHTI240853. PMID: 39234720.
- Lohr C, Faller J, Riedel A, Nguyen HM, Wolfien M, Hofenbitzer J, Modersohn L, Romberg J, Prasser F, Omeirat J, Wen Y, Galusch O, Hahn U, Seiferling M, Dieterich C, Klügl P, Matthies F, Kind J, Boeker M, Löffler M, Meineke F. GeMTeX's De-Identification in Action: Lessons Learned & Devil's Details. Stud Health Technol Inform. 2025 Sep 3;331:274-282. doi: 10.3233/SHTI251406. PMID: 40899551.
If you have further questions, do not hesitate to contact Christina Lohr.
