Skip to content

Data and notebooks to reproduce the paper: "Maximizing Efficiency of Dataset Compression for Machine Learning Potentials With Information Theory"

License

Notifications You must be signed in to change notification settings

digital-synthesis-lab/2025-compression-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data and Notebooks for: "Maximizing Efficiency of Dataset Compression for Machine Learning Potentials With Information Theory"

  • Data: Data
  • Notebooks: DOI

This repository contains the notebooks to reproduce the paper:

Benjamin Yu, Vincenzo Lordi, Daniel Schwalbe-Koda. "Maximizing Efficiency of Dataset Compression for Machine Learning Potentials With Information Theory" (2025).

  • All the raw data for plotting the notebooks can be downloaded using the download.sh script.
  • The Jupyter Notebooks in nbs contain all the code required to reproduce the analysis and the plots shown in the manuscript.

The algorithms are implemented under the QUESTS package.

Installing and running

To reproduce the results from the manuscript, first create a new Python environment using your preferred virtual environment (e.g., venv or conda). Then, clone this repository and install it with

git clone [email protected]:digital-synthesis-lab/2025-compression-data.git
cd 2025-compression-data
pip install -e .

This should install all dependencies (see pyproject.toml) to reproduce the data in the manuscript.

To download the raw data that has all the results for this paper (and the required data for analysis), simply run

chmod +x download.sh
./download.sh

in the root of the repository. While some of the data is already available in the repository, most of the raw data is too large for GitHub. Thus, part of the raw data that reproduces the paper is hosted on Zenodo for persistent storage (DOI: 10.5281/zenodo.17536234).

Data and Code Description

After downloading the raw data folder, the results will exhibit all data from the paper. The tarfile is sorted by dataset, with the following structure:

data/
├── GAP20
│   ├── Fullerenes
│   │   ├── dH
│   │   ├── SevenNet
│   │   ├── test
│   │   ├── train
│   │   └── val
│   ├── Nanotubes
│   │   └── ...
│   ├── Graphene
│   │   └── ...
├── TM23
│   ├── Ag
│   │   ├── data_cold
│   │   ├── data_warm
│   │   ├── dH_cold
│   │   ├── dH_warm
│   │   ├── SevenNet
│   │   └── test
│   ├── Au
│   │   └── ...
│   ├── Cd
│   │   └── ...
│   ├── Co
│   │   └── ...
│   ├── Ir
│   │   └── ...
│   ├── Pd
│   │   └── ...
│   ├── Ti
│   │   └── ...

data_csv/

The tarfile contains files of the following formats:

  • Training, testing and validation XYZ files generated for model training.
  • Training logs of every model trained
  • Delta entropy calculations of each compressed dataset with respect to the full dataset.
  • CSV files containing the post-processed results from the analysis, which are used for plotting all figures in the manuscript.

Citing

If you use the algorithms/benchmarks for compressing datasets in this work, please cite the following papers:

@article{yu2025compression,
    title = {Maximizing Efficiency of Dataset Compression for Machine Learning Potentials With Information Theory},
    author = {Yu, Benjamin, and Lordi, Vincenzo and Schwalbe-Koda, Daniel},
    year = {2025},
    journal = {arXiv:2511.10561},
    url = {https://doi.org/10.48550/arXiv.2511.10561},
    doi = {10.48550/arXiv.2511.10561},
}

@article{schwalbekoda2025information,
    title = {Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory},
    author = {Schwalbe-Koda, Daniel and Hamel, Sebastien and Sadigh, Babak and Zhou, Fei and Lordi, Vincenzo},
    year = {2025},
    journal = {Nature Communications},
    url = {https://doi.org/10.1038/s41467-025-59232-0},
    doi = {10.1038/s41467-025-59232-0},
    volume={16},
    pages={4014},
}

The code used to analyze and compress the dataset is available under the QUESTS package.

License

This repository is distributed under the following license: MIT

SPDX: MIT

Acknowledgements

This work was supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences under Award Number DE-SC0025642.

About

Data and notebooks to reproduce the paper: "Maximizing Efficiency of Dataset Compression for Machine Learning Potentials With Information Theory"

Resources

License

Stars

Watchers

Forks

Packages

No packages published