UMIE_datasets

🤩 About the Project

Warning: This project is currently in alpha stage and may be subject to major changes

This repository presents a suite of unified scripts to standardize, preprocess, and integrate 882,774 images from 20 open-source medical imaging datasets, spanning modalities such as X-ray, CT, and MR. The scripts allow for seamless and fast download of a diverse medical data set. We create a unified set of annotations allowing for merging the datasets together without mislabelling. Each dataset is preprocessed with a custom sklearn pipeline. The pipeline steps are reusable across the datasets. The code was designed so that preorocessing a new dataset is simple and requires only reusing the available pipeline steps with customization performed through setting the appropriate values of the pipeline params.

The labels and segmentation masks were unified to be compliant with RadLex ontology.

Datasets

uid	Dataset	Modality	TASK
0	KITS-23	CT	Classification/Segmentation
1	CoronaHack	XRAY	Classification
2	Alzheimers Dataset	MRI	Classification
3	Brain Tumor Classification	MRI	Classification
4	COVID-19 Detection X-Ray	XRAY	Classification
5	Finding and Measuring Lungs in CT Data	CT	Segmentation
6	Brain CT Images with Intracranial Hemorrhage Masks	CT	Classification
7	Liver and Liver Tumor Segmentation	CT	Classification, Segmentation
8	Brain MRI Images for Brain Tumor Detection	MRI	Classification
9	Knee Osteoarthritis Dataset with Severity Grading	XRAY	Classification
10	Brain Tumor Progression	MRI	Segmentation
11	Chest X-ray 14	XRAY	Classification
12	COCA- Coronary Calcium and chest CTs	CT	Segmentation
13	BrainMetShare	MRI	Segmentation
14	CT-ORG	CT	Segmentation
17	LIDC-IDRI	CT	Segmentation
18	CMMD	MG	Classification

Using the datasets

Installing requirements

poetry install

Creating the dataset

Due to the copyright restrictions of the source datasets, we can't share the files directly. To obtain the full dataset you have to download the source datasets yourself and run the preprocessing scripts.

0.KITS-23

KITS-23

Clone the KITS-23 repository.
Enter the KITS-23 directory and install the packages with pip.
```
cd kits23
pip3 install -e .
```
Run the following command to download the data to the dataset/ folder.
```
kits23_download_data
```

Fill in the source_path and target_path KITS-23Pipeline() in config/runner_config.py. e.g.

 KITS23Pipeline(
      path_args={
          "source_path": "kits23/dataset",  # Path to the dataset directory in KITS23 repo
          "target_path": TARGET_PATH,
          "labels_path": "kits23/dataset/kits23.json",  # Path to kits23.json
      },
      dataset_args=dataset_config.KITS23
  ),

1. Xray CoronaHack -Chest X-Ray-Dataset

1. Xray CoronaHack -Chest X-Ray-Dataset

Go to CoronaHack page on Kaggle.
Login to your Kaggle account.
Download the data.
Extract archive.zip.
Fill in the source_path to the location of the archive folder in CoronaHackPipeline() in config/runner_config.py.

2. Alzheimer's Dataset

2. Alzheimer's Dataset ( 4 class of Images)

Go to Alzheimer's Dataset page on Kaggle.
Login to your Kaggle account.
Download the data.
Extract archive.zip.
Fill in the source_path to the location of the archive folder in AlzheimersPipeline() in config/runner_config.py.

3. Brain Tumor Classification (MRI

3. Brain Tumor Classification (MRI)

Go to Brain Tumor Classification page on Kaggle.
Login to your Kaggle account.
Download the data.
Extract archive.zip.
Fill in the source_path to the location of the archive folder in BrainTumorClassificationPipeline() in config/runner_config.py.

4. COVID-19 Detection X-Ray

4. COVID-19 Detection X-Ray

Go to COVID-19 Detection X-Ray page on Kaggle.
Login to your Kaggle account.
Download the data.
Extract archive.zip.
REMOVE TrainData folder. We do not want augmented data at this stage.
Fill in the source_path to the location of the archive folder in COVID19DetectionPipeline() in config/runner_config.py.

5. Finding and Measuring Lungs in CT Dat

5. Finding and Measuring Lungs in CT Data

Go to Finding and Measuring Lungs in CT Data page on Kaggle.
Login to your Kaggle account.
Download the data.
Extract archive.zip.
Fill in the source_path to the location of the archive/2d_images folder in FindingAndMeasuringLungsPipeline() in config/runner_config.py. Fill in masks_path with the location of the archive/2d_masks folder.

6. Brain CT Images with Intracranial Hemorrhage Masks

6. Brain CT Images with Intracranial Hemorrhage Masks

Go to Brain With Intracranial Hemorrhage page on Kaggle.
Login to your Kaggle account.
Download the data.
Extract archive.zip.
Fill in the source_path to the location of the archive folder in BrainWithIntracranialHemorrhagePipeline() in config/runner_config.py. Fill in masks_path with the same path as the source_path.

7. Liver and Liver Tumor Segmentation (LITS)

7. Liver and Liver Tumor Segmentation (LITS)

Go to Liver and Liver Tumor Segmentation.
Login to your Kaggle account.
Download the data.
Extract archive.zip.
Fill in the source_path to the location of the archive folder in COVID19DetectionPipeline() in config/runner_config.py. Fill in masks_path too.

8. Brain MRI Images for Brain Tumor Detection

8. Brain MRI Images for Brain Tumor Detection

Go to Brain MRI Images for Brain Tumor Detection page on Kaggle.
Login to your Kaggle account.
Download the data.
Extract archive.zip.
Fill in the source_path to the location of the archive folder in BrainTumorDetectionPipeline() in config/runner_config.py.

9. Knee Osteoarthrithis Dataset with Severity Grading

9. Knee Osteoarthrithis Dataset with Severity Grading 1. Go to Knee Osteoarthritis Dataset with Severity Grading. 2. Login to your Kaggle account. 3. Download the data. 4. Extract archive.zip. 5. Fill in the source_path to the location of the archive folder in COVID19DetectionPipeline() in config/runner_config.py.

10. Brain-Tumor-Progression

10. Brain-Tumor-Progression

Go to Brain Tumor Progression dataset from the cancer imaging archive.

11. Chest X-ray 14

11. Chest X-ray 14

Go to Chest X-ray 14.
Create an account.
Download the images folder and DataEntry2017_v2020.csv.

12. COCA- Coronary Calcium and chest CTs

12. COCA- Coronary Calcium and chest CTs

Go to COCA- Coronary Calcium and chest CTs.
Log in or sign up for a Stanford AIMI account.
Fill in your contact details.
Download the data with azcopy.
Fill in the source_path with the location of the cocacoronarycalciumandchestcts-2/Gated_release_final/patient folder. Fill in masks_path with cocacoronarycalciumandchestcts-2/Gated_release_final/calcium_xml xml file.

13. BrainMetShare

13. BrainMetShare

Go to BrainMetShare.
Log in or sign up for a Stanford AIMI account.
Fill in your contact details.
Download the data with azcopy.

14. CT-ORG

14. CT-ORG

Go to CT-ORG page on Cancer imaging archive.
Download the data.
Extract PKG - CT-ORG.
Fill in the source_path to the location of the OrganSegmentations folder in CtOrgPipeline() in config/runner_config.py. Fill in masks_path with the same path as the source_path.

17. LIDC-IDRI

17. LIDC-IDRI

Go to LIDC-IDRI.
Download "Images" using NBIA Data Retriever, and "Radiologist Annotations/Segmentations".
Extract LIDC-XML-only.zip.
Fill in the source_path in CmmdPipeline() in config/runner_config.py with the location of the manifest-{xxxxxxxxxxxxx}/LIDC-IDRI directory.
Fill in the masks_path in CmmdPipeline() in config/runner_config.py with the location of the LIDC-XML-only/ directory.

18. CMMD - The Chinese Mammography Database

18. CMMD

Go to CMMD.
Download .tcia file from Data Access table.
Download NBIA Data Retriver to be able to download images.
Download CMMD_clinicaldata_revision.xlsx from Data Access table for labels information.
Fill in the source_path in CmmdPipeline() in config/runner_config.py with the location of the manifest-{xxxxxxxxxxxxx}/CMMD folder.
Fill in the labels_path in CmmdPipeline() in config/runner_config.py with the location of the CMMD_clinicaldata_revision.xlsx file.

To preprocess the dataset that is not among the above, search the preprocessing folder. It contains the reusable steps for changing imaging formats, extracting masks, creating file trees, etc. Go to the config file to check which masks and label encodings are available. Append new labels and mask encodings if needed.

Overall the dataset should have 882,774 images in .png format

CT - 500k+
X-Ray - 250k+
MRI - 100k+

🎯 Roadmap

👋 Contributors

🤝 Contact

Barbara Klaudel

TheLion.AI

Development

Pre-commits

Install pre-commits https://pre-commit.com/#installation

If you are using VS-code install the extention https://marketplace.visualstudio.com/items?itemName=MarkLarah.pre-commit-vscode

To make a dry-run of the pre-commits to see if your code passes run

pre-commit run --all-files

Adding python packages

Dependencies are handeled by poetry framework, to add new dependency run

poetry add <package_name>

Debugging

To modify and debug the app, development in containers can be useful .

Testing

run_tests.sh

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github		.github
.trunk		.trunk
config		config
scripts		scripts
src		src
testing		testing
utils		utils
.coverage		.coverage
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
LICENCE.md		LICENCE.md
README.md		README.md
data_preparation_instruction.md		data_preparation_instruction.md
dataset_modules.png		dataset_modules.png
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UMIE_datasets

🤩 About the Project

Datasets

Using the datasets

Installing requirements

Creating the dataset

KITS-23

🎯 Roadmap

👋 Contributors

🤝 Contact

Development

Pre-commits

Adding python packages

Debugging

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 13

Uh oh!

Languages

License

TheLion-ai/UMIE_datasets

Folders and files

Latest commit

History

Repository files navigation

UMIE_datasets

🤩 About the Project

Datasets

Using the datasets

Installing requirements

Creating the dataset

KITS-23

🎯 Roadmap

👋 Contributors

🤝 Contact

Development

Pre-commits

Adding python packages

Debugging

Testing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 13

Uh oh!

Languages

Packages