This repository contains benchmarking scripts and data for Chemprop v2, a message passing neural network for molecular property prediction, as described in the paper Chemprop v2: Modular, Fast, and User-Friendly. Please refer to the Chemprop repository for installation and usage instructions.
All datasets used in the study can be downloaded from Zenodo. You can either download and extract the file data.tar.gz yourself, or run
wget https://zenodo.org/records/10078142/files/data.tar.gz
tar -xzvf data.tar.gz
The data folder should be placed within the chemprop_benchmark_v2 folder (i.e. where this README and the scripts folder are located).
The paper reports a large number of benchmarks that can be run individually by executing one of the shell scripts in the scripts folder. For example, to run the barriers_e2 reaction benchmark, activate your Chemprop environment as described in the Chemprop repository, and then run (after adapting the path to your Chemprop folder):
cd scripts
./barriers_e2.sh
This will run a hyperparameter search, as well as a training run on the best hyperparameters, and produce the results_barriers_e2 folder with all the necessary information, including model checkpoints and test set predictions.
The following benchmarking systems were used in the paper:
hivHIV replication inhibition from MoleculeNet and OGB with scaffold splitspcba_randomBiological activities from MoleculeNet with random splitspcba_random_nansBiological activities from MoleculeNet with missing targets NOT set to zero (to be comparable to the OGB version) with random splitspcba_scaffoldBiological activities from MoleculeNet and OGB with scaffold splitsqm9_multitaskDFT calculated properties from MoleculeNet and OGB, trained as a multi-task modelqm9_u0DFT calculated properties from MoleculeNet and OGB, trained as a single-task model on the target U0 onlyqm9_gapDFT calculated properties from MoleculeNet and OGB, trained as a single-task model on the target gap onlysamploctanol–water partition coefficients (SAMPL6 & 7) and toluene–water partition coefficients (SAMPL9)barriers_e2Reaction barrier heights of E2 reactionsbarriers_sn2Reaction barrier heights of SN2 reactionsbarriers_cycloaddReaction barrier heights of cycloaddition reactionsbarriers_rdb7Reaction barrier heights in the RDB7 datasetbarriers_rgd1Reaction barrier heights in the RGD1-CNHO datasetmulti_moleculeUV/Vis peak absorption wavelengths in different solventspcqm4mv2HOMO-LUMO gaps of the PCQM4Mv2 datasettimingTiming benchmark using subsets of the QM9 gap
The benchmarks were performed using Chemprop v2.0.3. To reproduce the exact environment used in this study, you can create a conda environment using the provided environment.yml file.