This repository accompanies the manuscript “Streamlining the Histopathological Workflow in Diabetic Kidney Disease with Artificial Intelligence” and includes the source code to:
- Glomerulus detection: Instance segmentation with Mask R-CNN to localize and extract glomeruli from PAS-stained whole-slide images.
- Annotation pipeline: Templates and helper functions to launch, monitor, and retrieve annotation jobs on AWS, reducing study evaluation turnaround by up to 80%.
- Automated CKD scoring: AI-based classifiers for semi-quantitative glomerular grading, achieving expert-level performance and cutting turnaround time by up to 90%.
- Self-supervised learning: Self-supervised learning on unlabeled preclinical data to improve robustness and mitigate expert bias.
- Translation to human biopsies: Unsupervised Feature Translation (UFT) adapts mouse-trained features to human tissue at inference, reducing the translational gap without human labels.
Build the Docker image using the Dockerfile in docker_image/Dockerfile.
cd docker_image
docker build -f Dockerfile -t kidneyaidocker \
--build-arg UID=$(id -u) \
--build-arg GID=$(id -g) \
--build-arg USER=$(whoami) \
--build-arg GROUP=$(id -g -n) .- Image name:
kidneyaidocker - Context: current directory
- Args: matches host UID/GID for smooth file permissions
Use kidneyai_detection.py to train a detector or extract glomeruli from a study.
- Default config:
config_files/Detection.json - Override config:
--params_path - Help:
python kidneyai_detection.py --help
- detection_set_generation: true
Preprocess and build a training set perdetection_set_generation_params.
Creates:dkd_detection_setin the main data directory. - unnanotated_patch_generation: true
Tile each slide perunnanotated_patch_generation_params(for unlabeled WSIs). - train_detection_model: true
Train on the generated detection set.
Configure with:model_params,dataloader_params,dataset_params,optimization_params,transfer_learning_params,log_params.
Training recipe:train_detection_model_params. - detect_on_new_study: true
Run detection on a study - configured withdetect_on_new_study_params.
Prepare datasets for AWS annotation using kidneyai_detection.py.
- Default config:
config_files/Detection.json - Enable AWS flow:
general_params -> what_to_run -> interact_with_aws: true - Control logic:
aws_interactions
Key switches:
- create_annotation_patches -> apply: true
Crops and saves detected glomeruli from the detection step. Remaining fields define what/how to crop. - upload_data_to_s3 -> apply: true
Uploads cropped data to S3. Remaining fields define targets and filters.
Notebooks:
- Start annotation job:
Create_Annotation_job.ipynb - Collect results:
Collect_responses_from_AWS.ipynb
Train classifiers and infer DKD scores using kidneyai_classification.py.
- Default config:
config_files/Classification.json - Override config:
--params_path - Help:
python kidneyai_classification.py --help
Predefined configs reproduce supervised, self-supervised, transfer learning, and unsupervised feature adaptation (UFT) as described in the paper.
python kidneyai_classification.py--kfold: k-fold train/test as defined ink_fold_params--kfold_multi: k-fold; evaluate against all annotators at test time- Multi-annotated slides must be marked as
multiinclassification -> defaults -> fold.py(L314). - Slides marked
singleare training-only in this setup.
- Multi-annotated slides must be marked as
- To use a pretrained encoder (e.g., DINO), set
transfer_learning_paramsaccordingly.
DINO:
python kidneyai_classification.py --params_path config_files/SSL-DINO-mouse.json --dinoOther SSL methods:
- BYOL: add
--byol - SimSiam: add
--simsiam
Three-step workflow:
- Pretrain (SSL)
python kidneyai_classification.py --params_path config_files/SSL-DINO-mouse.json --dino- Supervised head on frozen SSL encoder (Dual Model)
- In
config_files/DualModel.mouse.json, setmodel_params -> UFT -> foundation_paramsto load the feature extractor.
python kidneyai_classification.py --params_path config_files/DualModel.mouse.json --uft- Translate encoder on target domain + evaluate
python kidneyai_classification.py --params_path config_files/UFT_translation.json --translation_kfold_multi--translation_kfold_multi: update encoder and evaluate across folds (k_fold_params)--translation_kfold_multi_full: translate on the entire target dataset (single adapted model)- Also runs evaluation on the target dataset post-translation
Note:data_defs -> labeled_picklesshould match inference pickles.
Important settings:
training_params -> target_uft_cls_name: which model to updatetraining_params -> model_name: output name of translated modeltransfer_learning_params -> use_pretrained: trueandpretrained_methodset to the SSL method used (e.g., DINO). Pretrained weights should be those from Step 1.
Pseudo-label path: You can create pseudo labels and still use --translation_kfold_multi. Otherwise:
- Alternative 1:
Run DINO and settransfer_learning_params -> pretrained_method: "UFT".python kidneyai_classification.py --params_path config_files/SSL-DINO-mouse.json --dino
- Alternative 2 (recommended):
Combine--dinoand--ssl_uftto load the dual model and update only the feature extractor.
After adaptation, run inference and save predictions:
python kidneyai_classification.py --params_path config_files/DualModel.translated.json --uft --inferenceEnsure transfer-learning parameters for both the feature extractor and classifier are consistent in all alternatives.
Organize your datasets as follows:
/path_to_data/DKD/
├─ study_1/
│ ├─ unannotated/ # WSIs without labels
│ ├─ annotated/ # WSIs with paired .xml annotations
│ ├─ additional_localizations/ # (optional; detection-only) .xml files for unannotated WSIs
│ └─ glomeruli/
│ ├─ glomeruli-patches/ # cropped glomeruli
│ ├─ glomeruli-refs/ # reference points on WSI
│ ├─ glomerulus_patch_list.pickle
│ ├─ dkd_annotations_expert1.pickle
│ ├─ dkd_annotations_expert2.pickle
│ └─ dkd_annotations_expert3.pickle
└─ study_wa_ua/
└─ ...
Notes:
glomerulus_patch_list.picklecontains a list of dictionaries describing each crop (WSI location, image path, etc.). This is the pickle used for AWS uploads.- Keep all DKD glomerular annotations in the study’s
glomeruli/folder. - In
additional_localizations/, include only.xml; WSIs in that folder are not auto-read.
-
data_defs
- labeled_pickles: List of pickles with labeled data (can be one or more pickle filenames).
- unlabeled_pickles: List of pickles with unlabeled data for self-supervision (can be one or more pickle filenames).
- semilabeled_pickles: List of pickles with weakly annotated data for weak/self-supervised learning (can be one or more pickle filenames).
- inference_pickles: List of pickles with data for DKD inference (can be one or more pickle filenames).
-
dataset_params
- data_location: datasets root directory
- train_transforms: training augmentations
- val_transforms: validation augmentations
- test_transforms: test augmentations
-
dataloader_params
- Standard DataLoader controls (batch size, num_workers, pin_memory, etc.)
-
model_params
- backbone_type: e.g., resnet50, deit_small
- transformers_params:
- img_size: input size
- patch_size: transformer patch size
- pretrained_type: "supervised" (ImageNet) or "dino" (ImageNet SSL)
- pretrained: use ImageNet weights
- freeze_backbone: freeze encoder
- DINO: hyperparameters for DINO training
-
optimization_params
- optimizer:
- type: optimizer class
- autoscale_rl: scale LR by batch size
- params: LR, weight decay, momentum/betas, etc.
- LARS_params: use LARS if
use: trueand batch size ≥batch_act_thresh - scheduler:
- type: scheduler pipeline (list)
- params: scheduler-specific hyperparameters
- optimizer:
-
training_params
- model_name: the model's name
- val_every: validation frequency (epochs, float)
- log_every: logging frequency (iterations)
- save_best_model: keep best model by validation metric
- log_embeddings: plot UMAPs each validation
- knn_eval: evaluate kNN metrics during validation
- grad_clipping: > 0 enables clipping
- use_mixed_precision: enable AMP
- save_dir: path_to_checkpoints/checkpoint directory
-
system_params
- Device usage and GPU selection (not that when more than 1 GPUs is enabled DDP will get triggered)
-
log_params
- Project/run names for logging (default: Weights & Biases)
-
lr_finder
- grid_search_params:
- min_pow, max_pow: LR search range (10^min_pow to 10^max_pow)
- resolution: number of LR candidates
- n_epochs: maximum epochs for search
- random_lr: sample random LRs in range
- keep_schedule: keep LR scheduler during search
- report_intermediate_steps: validate/log during search
- grid_search_params:
-
transfer_learning_params
- use_pretrained: enable pretrained backbone
- pretrained_model_name: name of pretrained model
- pretrained_path: directory with pretrained weights (typically the same as save_dir)
