Data and Notebooks for: "Maximizing Efficiency of Dataset Compression for Machine Learning Potentials With Information Theory"

Data:
Notebooks:

This repository contains the notebooks to reproduce the paper:

Benjamin Yu, Vincenzo Lordi, Daniel Schwalbe-Koda. "Maximizing Efficiency of Dataset Compression for Machine Learning Potentials With Information Theory" (2025).

All the raw data for plotting the notebooks can be downloaded using the download.sh script.
The Jupyter Notebooks in nbs contain all the code required to reproduce the analysis and the plots shown in the manuscript.

The algorithms are implemented under the QUESTS package.

Installing and running

To reproduce the results from the manuscript, first create a new Python environment using your preferred virtual environment (e.g., venv or conda). Then, clone this repository and install it with

git clone [email protected]:digital-synthesis-lab/2025-compression-data.git
cd 2025-compression-data
pip install -e .

This should install all dependencies (see pyproject.toml) to reproduce the data in the manuscript.

To download the raw data that has all the results for this paper (and the required data for analysis), simply run

chmod +x download.sh
./download.sh

in the root of the repository. While some of the data is already available in the repository, most of the raw data is too large for GitHub. Thus, part of the raw data that reproduces the paper is hosted on Zenodo for persistent storage (DOI: 10.5281/zenodo.17536234).

Data and Code Description

After downloading the raw data folder, the results will exhibit all data from the paper. The tarfile is sorted by dataset, with the following structure:

data/
├── GAP20
│   ├── Fullerenes
│   │   ├── dH
│   │   ├── SevenNet
│   │   ├── test
│   │   ├── train
│   │   └── val
│   ├── Nanotubes
│   │   └── ...
│   ├── Graphene
│   │   └── ...
├── TM23
│   ├── Ag
│   │   ├── data_cold
│   │   ├── data_warm
│   │   ├── dH_cold
│   │   ├── dH_warm
│   │   ├── SevenNet
│   │   └── test
│   ├── Au
│   │   └── ...
│   ├── Cd
│   │   └── ...
│   ├── Co
│   │   └── ...
│   ├── Ir
│   │   └── ...
│   ├── Pd
│   │   └── ...
│   ├── Ti
│   │   └── ...

data_csv/

The tarfile contains files of the following formats:

Training, testing and validation XYZ files generated for model training.
Training logs of every model trained
Delta entropy calculations of each compressed dataset with respect to the full dataset.
CSV files containing the post-processed results from the analysis, which are used for plotting all figures in the manuscript.

Citing

If you use the algorithms/benchmarks for compressing datasets in this work, please cite the following papers:

@article{yu2025compression,
    title = {Maximizing Efficiency of Dataset Compression for Machine Learning Potentials With Information Theory},
    author = {Yu, Benjamin, and Lordi, Vincenzo and Schwalbe-Koda, Daniel},
    year = {2025},
    journal = {arXiv:2511.10561},
    url = {https://doi.org/10.48550/arXiv.2511.10561},
    doi = {10.48550/arXiv.2511.10561},
}

@article{schwalbekoda2025information,
    title = {Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory},
    author = {Schwalbe-Koda, Daniel and Hamel, Sebastien and Sadigh, Babak and Zhou, Fei and Lordi, Vincenzo},
    year = {2025},
    journal = {Nature Communications},
    url = {https://doi.org/10.1038/s41467-025-59232-0},
    doi = {10.1038/s41467-025-59232-0},
    volume={16},
    pages={4014},
}

The code used to analyze and compress the dataset is available under the QUESTS package.

License

This repository is distributed under the following license: MIT

SPDX: MIT

Acknowledgements

This work was supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences under Award Number DE-SC0025642.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
figs		figs
generate_data		generate_data
nbs		nbs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download.sh		download.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data and Notebooks for: "Maximizing Efficiency of Dataset Compression for Machine Learning Potentials With Information Theory"

Installing and running

Data and Code Description

Citing

License

Acknowledgements

About

Uh oh!

Releases 1

Packages

Languages

License

digital-synthesis-lab/2025-compression-data

Folders and files

Latest commit

History

Repository files navigation

Data and Notebooks for: "Maximizing Efficiency of Dataset Compression for Machine Learning Potentials With Information Theory"

Installing and running

Data and Code Description

Citing

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages