This project automates the acquisition, metadata extraction, and consolidation of publicly available 3D electron microscopy datasets, with the future goal of enabling efficient, block-wise access for AI/ML pipelines.
- Automated Data Download: Robust scripts to download diverse 3D electron microscopy datasets from multiple sources.
- Metadata Extraction & Consolidation: Identify, extract, and harmonize relevant metadata (e.g., attrs, chunks) from each dataset, providing a unified and queryable view (see
METADATA_SUMMARY.md). - AI/ML Pipeline Data Access Design: Outline and prototype a strategy for block-wise access to large 3D image datasets for scalable AI/ML workflows (see
DATA_ACCESS_DESIGN.md).
The following publicly available 3D electron microscopy datasets are targeted:
- EMPIAR-11759: https://www.ebi.ac.uk/empiar/EMPIAR-11759/
- EPFL-Hippocampus: https://www.epfl.ch/labs/cvlab/data/data-em/
- Hemibrain-NG: https://tinyurl.com/hemibrain-ng (Note: Only a random 1000x1000x1000 pixel crop region will be downloaded for this dataset.)
- JRC-MUS-NACC: https://openorganelle.janelia.org/datasets/jrc_mus-nacc-2
- U2OS-Chromatin: https://idr.openmicroscopy.org/webclient/img_detail/9846137/?dataset=10740
- Python 3.12
- DVC (optional, for data versioning)
- tifffile for handling TIFF files
- cloud-volume for Neuroglancer data
- zarr for scalable array storage
- pyDM3reader for DM3 files
- requests, ftplib for downloads
- pandas for summary tables
- See
requirements.txtfor the full list
git clone https://github.com/helensilva14/3d-electron-data-project.git
cd 3d-electron-data-project
python3 -m venv env
source env/bin/activate
pip install -r requirements.txtRun the main pipeline (downloads data, extracts metadata, consolidates):
python3 src/main.pyOutputs will be saved in the outputs/ and docs/ directories.
This project is licensed under the Apache-2.0 License.