Graphs_integration #161

EnricoTrizio · 2024-11-13T17:20:25Z

General description

Add the code for CVs based on GNN in the most (possible) organic way.
This largely inherits from Jintu's work (kudos! @jintuzhang), where all the code was based on a "library within the library".
Some functions were much different from the rest of the code (e.g., all the code for GNN models), others were mostly redundant (e.g., GraphDataset, GraphDataModule, CV base, and specific CV classes).

It was wise to reduce the code duplicates and redundancies and make the whole library more organic, still including all the new functionalities.

The modifications have been made to keep the user experience as close as possible to what it was when only descriptor-based models were implemented. For example, the defaults are based on such a scenario, when applicable, and the use of GNN models needs to be "activated" by the user.

SPOILER: This required some thinking and some modifications here and there

General questions

Shall we make the overall structure smoother? i.e., no too many utils.py here and there and too many submodules?
Shall we keep the current names for the graph keys in the datasets? i.e, data_list, z_table etc
Do we like the metadata thing for the datasets?
What shall we do with the BLOCKS? Is it worth it to keep this thing?

General todos

Check everything 😄
Double check the Docs

Point-by-point description

Data handling

Overview

So far, we have a DictDataset (based on torch.Dataset) and the corresponding DictModule (based on lightning.lightningDataModule).

For GNNs, there was a GraphDataset (based on lists) and the corresponding DictModule (based on lightning.lightningDataModule).
Here, the data are handled for convenience using the PyTorchGeometric framework.
There are also a bunch of auxiliary functions for neighborhoods and handling of atom types, plus some utilities to initialize the dataset easily from files.

Implemented solution

The two things are merged:

A single DictDataset that can handle both types of data.

It also has a metadata attribute that stores general properties in a dict (e.g., cutoff and atom_types).
In the __init__, the user can specify the data_type (either descriptors (default) or graphs. This is then stored in metadata and is used in the DictLoader to handle the data the right way (see below)
New utils have been added in mlcolvar.data.utils: save_dataset, load_dataset and save_dataset_configurations_as_extyz

A single DictModule that can handle both types of data. Depending on the metadata['data_type'] of the incoming dataset, it either uses our DictLoader or the torch_geometric.DataLoader.
A new submodule data.graph containing:

atomic.py for the handling of atomic quantities based on the data class Configuration
neighborhood.py for building neighbor lists using matscipy
utils.py to frame Configurations into dataset and one-hot embeddings. It also contains create_test_graph_input as creating inputs for testing here requires several lines of code.

A new create_dataset_from_trajectories utils in mlcolvar.utils.io that allows creating a dataset directly from some trajectory files, providing topology files and using mdtraj, thus allowing easy handling of the more complex bio-simulations formats. For solids/surfaces/chemistry, the util handles .xyz files using a combination of ase and mdtraj to be efficient and retain the convenient mdtraj atom selection.
A single create_timelagged_dataset that can also create the time-lagged dataset starting from DictDataset with data_type=='graphs'

NB For the graph datasets, the keys are the original ones:

data_list: all the graph data, e.g., edge src and dst, batch index... (this goes in DictDataset)
z_table: atomic numbers map (this goes in DictDataset.metadata)
cutoff: cutoff used in the graph (this goes in DictDataset.metadata)

GNN models

Overview

Of course, they needed to be implemented 😄 but we could inherit most of the code from Jintu.
As an overview, there is a BaseGNN parent class that implements the common features, and then each model (e.g., SchNet or GVP) is implemented on top of that.
There is also a radial.py that implements a bunch of tools for radial embeddings.

Implemented solution

The GNN code is now implemented in mlcolvar.core.nn.graph.

There is a BaseGNN class that is a template for the architecture-specific code. This, for example, already has the methods for embedding edges and setting some common properties.
The Radial module implements the tools for radial embeddings
The SchNetModel and GVPModel are implemented based on BaseGNN
In utils.py, there is a function that creates data for the tests for this module. This could be replaced using the very similar function mlcolvar.data.graph.utils.create_test_graph_input that is more general and used also for other things

CV models

Overview

In Jintu's implementation, all the CV classes we tested were re-implemented, still using the original loss function code.
The point there is that the initialization of the underlying ML model (also in the current version of the library) is performed within the CV class.
We did it to make it simple, and indeed, it is for feed-forward networks, as they have very few things to set (i.e., layers, nodes, activations), and also because there were no alternatives at the time.
For GNNs, however, the initialization can vary a lot (i.e., different architectures and many parameters one could set).

We couldn't cut corners here to include everything, thus somewhere we need to add an extra layer of complexity to either the workflow or the CV models.

Implemented solution

We keep everything similar to what it used to be in the library, except for:

We rename the layers keyword to the more general model in the init of the CV classes that can accept

A list of integers, as it was before. It works as the old layers keyword and initializes a FeedForward with that and all the DEFAULT_BLOCKS' (see point 2), e.g., for DeepLDA: ['norm_in', 'nn', 'lda']`.
A mlcolvar.core.nn.FeedForward or mlcolvar.core.nn.graph.BaseGNN model that you had initialized outside the CV class. This way, one overrides the old default and provides an external model and uses the MODEL_BLOCKS, e.g. for DeepLDA: ['nn', 'lda']. For example, the initialization can be something like this

# for GNN-based
gnn_model = SchNet(...)
model = DeepLDA(..., model=gnn_model, ...)

# for FFNN-based, alternative 1, this keeps the normalization from BLOCKS
model = DeepLDA(..., model=[2, 3], ...)

# for FFNN-based, alternative 2, this uses the MODEL_BLOCKS
ff_model = FeedForward(layers=[2, 3])
model = DeepLDA(..., model=ff_model, ...)

The BLOCKS of each CV model are duplicated in DEFAULT_BLOCKS' and MODEL_BLOCKS` to account for the different behaviors. This was a simple way to initialize everything in all the cases (maybe not best one, see questions)
In the training step, the change amounts to having a different setup of the data depending on the type of ML-model we are using, then the rest is basically the same as it was.

Things to note

All the loss functions are untouched! Except for the CommittorLoss as it does not depend only on the output space but also on the derivatives wrt the input/positions.
When an external GNN model is provided, logging is still not working. I left it things for the very end of the PR, focusing on making the things work before.
Autoencoder-based CVs only raise a NotImplementedError, as we do not have, for now, a stable GNN-AE. As a consequence, the MultiTaskCV also does not support GNN models, as, in the way we intend it, it wouldn't make much sense without a GNN-based AE.

TODOs

Make logger work with graph models 🗡️
Add autoencoders (in the future)

Explain module

Overview

There is a new module graph_sensitivity that performs a per-node sensitivity analysis. Some internal functions have been adapted to handle both types of datasets.

TODOs

Maybe we can add something to visualize the results on the molecule?

Status

Ready to go

Deleted `utils.py` to resolve the rebase problem.

…s util

mlcolvar/data/graph/utils.py

mlcolvar/tests/test_utils_io.py

mlcolvar/utils/io.py

mlcolvar/utils/timelagged.py

mlcolvar/cvs/committor/utils.py

mlcolvar/data/graph/utils.py

mlcolvar/utils/io.py

…raphs_integration

mlcolvar/core/nn/graph/gnn.py

mlcolvar/core/nn/graph/gvp.py

mlcolvar/core/nn/graph/schnet.py

jintuzhang added 30 commits May 9, 2024 13:30

Updated torch tools.

12f37dd

Modified shapes of some dataset fields.

167e608

Added models.

9a8867a

Modified import schemes.

debec0e

Renamed a directory.

6ed1c87

Modified import schemes.

0998a66

Added a doc string.

016c1e4

Changed test function outputs.

ca194f2

Modified shapes of some dataset fields.

31e0064

Minor fix.

4b04f18

Added DeepTDA.

00b3734

Fixed lightning training.

5b4550c

Changed the default batch size.

ef44fc6

Removed nonlinearity in the readout function.

07db17d

Updated default parameters.

9a5ad39

Made everything compilable.

18c4c27

Deleted `utils.py` to resolve the rebase problem.

Change units to Angstrom.

99afa6a

Fixed cell size.

98ac5b1

Added more model attributes.

927c55d

Fixed for plumed interface.

cda2b31

Added tests

6e50651

Fixed dtype.

c2b821a

Modified default parameters.

c81dd46

Removed unused code.

eeec208

Added a dataset option.

58845a8

Added a progress bar.

c1bf864

Be quiet during testing.

b5d986c

Fixed spaces.

3d27e1a

Added a very basic readme.

39afb95

Fixed normal DeepTDA training.

ec0b715

EnricoTrizio added 2 commits April 9, 2025 17:43

Added support for xyz trajectories in create dataset from trajectorie…

d4391cb

…s util

Updated test notebook

df6c7ff

github-advanced-security bot found potential problems Apr 9, 2025

View reviewed changes

EnricoTrizio added 7 commits April 14, 2025 16:10

Added example input method for graphs

2e84eba

FIxed tracing for graphs + docs

ff48440

Added conversion key to loader

11e2e70

Added example notebook for gnn based CVs

cde077f

Removed useless test files

b518757

Added kolmogorov bias for gnn-based + typos fix

1da0d0d

Removed old files

2cf2d64

github-advanced-security bot found potential problems Apr 14, 2025

View reviewed changes

mlcolvar/cvs/committor/utils.py Fixed Show fixed Hide fixed

mlcolvar/cvs/committor/utils.py Fixed Show fixed Hide fixed

mlcolvar/cvs/committor/utils.py Fixed Show fixed Hide fixed

mlcolvar/data/graph/utils.py Fixed Show fixed Hide fixed

EnricoTrizio added 4 commits April 15, 2025 12:16

Removed cuda accelerator default for tests

ccbd62b

Replaced os.remove instances to temp files

a8547c7

Code polishing

87edbd0

Code polishing

8f852a5

github-advanced-security bot found potential problems Apr 22, 2025

View reviewed changes

mlcolvar/utils/io.py Fixed Show fixed Hide fixed

mlcolvar/utils/io.py Fixed Show fixed Hide fixed

EnricoTrizio added 8 commits April 22, 2025 15:10

Revert change from mltitask compatibility

1c9f8fa

Merge branch 'main' of https://github.com/luigibonati/mlcolvar into g…

7107e30

…raphs_integration

Fix unit conversion

d49f841

Merge branch 'main' of https://github.com/luigibonati/mlcolvar into g…

9aeffc4

…raphs_integration

Added optional pooling in gnn models

044d694

Improved tests gnn module

50cb4e5

Improved tests

7aba62c

Merge branch 'main' of https://github.com/luigibonati/mlcolvar into g…

867b57f

…raphs_integration

github-advanced-security bot found potential problems Jun 5, 2025

View reviewed changes

mlcolvar/core/nn/graph/gnn.py Fixed Show fixed Hide fixed

mlcolvar/core/nn/graph/gvp.py Fixed Show fixed Hide fixed

mlcolvar/core/nn/graph/schnet.py Fixed Show fixed Hide fixed

EnricoTrizio added 5 commits June 5, 2025 15:05

Code formatting

670a2ef

Improved graph tutorial

2c7c095

Code polishing

10afc85

Improved test and coverage

2462722

Updated Readme

44254dc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Graphs_integration #161

Graphs_integration #161

Uh oh!

EnricoTrizio commented Nov 13, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Graphs_integration #161

Are you sure you want to change the base?

Graphs_integration #161

Uh oh!

Conversation

EnricoTrizio commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

General description

General questions

General todos

Point-by-point description

Data handling

Overview

Implemented solution

GNN models

Overview

Implemented solution

CV models

Overview

Implemented solution

Things to note

TODOs

Explain module

Overview

TODOs

Status

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EnricoTrizio commented Nov 13, 2024 •

edited

Loading