Interpreting ML kinetics

This repository contains the data and code for the work entitled "Interpretable predictions of SN2 kinetics using the BERT and RF architectures: Comparison to known reactivity rules".

TMAPs

Interactive TMAPs of the training and test data, and accurate and inaccurate predictions are provided as .html files. The following .js files should be downloaded in the same directory as the .html files:

accurate_predictions.js, Accurate_predictions_TMAP.html
inaccurate_predictions.js, Inaccurate_predictions_TMAP.html
test_distribution.js, Test_data_TMAP.html
train_distribution.js, Train_data_TMAP.html

Data

This folder contains three subdirectories, with each containing the data and results for the DFT and ML calculations: "DFT", "ML_analysis_and_results" and "ML_datasets". The files that are contained within each of these subdirectories are as follows:

DFT

Input and output files from DFT calculations that were executed using the autodE interface: "autodE_inps" and "autodE_outs"
Input and output files from DFT caculations that were executed without using the autodE interface: "ORCA_SPECIES_CALCULATION_TYPE_inps" and "ORCA_SPECIES_CALCULATION_TYPE_outs"
Output code from frequency shift calculations in OTherm: "OTherm_outs"
The DFT dataset: "Data_for_dft.xlsx"
Rotational symmetry numbers of species whose thermochemistry was calculated without using the autodE interface: "SPECIES_symmetry_numbers.csv"
Summary of the main results from DFT calculations that were executed without using the autodE interface: "Iteration_of_DFT_outs.xlsx"

ML_datasets

This subdirectory contains the input data for each of the trained ML models. "Input_data_DATASET_NAME.xlsx" is the raw dataset with reactions in SMILES format, solvent names in text format, and logk, temperature, solvent mole fractions, and ionic strength in float format. "Total_test_processed.xlsx" is the total test data with SMILES of each species in the reaction separated into their own columns, and "Total_train_unstandardized_isida.xlsx" is the unstandardized ISIDA fragment representation of the total train data. Model inputs are provided in the following folders:

Datasets in .rdf format for input into the RF training scripts: "rdfs"
Cross-validation folds of datasets featurized for the RF model, generated by the RF training scripts: "RF_cv_splits"
Cross-validation folds of datasets featurized for the BERT model: "BERT"
Input data for the RF model trained on reaction center ISIDA fragments: "RF_rxn_center_only"

ML_analysis_and_results

This subdirectory contains three folders: "RF", "BERT", and "General".

"General" contains the data relevant to the analysis of both RF and BERT:

Results from atom mapping calculations carried out on the test data: "Atom_mapping"
Results from reaction center substructure matching to the representative examples of LG, steric, nucleophilic, and allylic effects: "Representative examples"

"RF" and "BERT" both have the following structure:

Data and results from interpreting model predictions: "MODEL_interpretation"
Data and results for representative examples of LG, steric, nucleophilic, and allylic effects: "MODEL_representative_examples"
Results from training and evaluating the models: "MODEL_training_and_evaluation"

"RF" also contains the folder "Rxn_center_only" which countains the data and results from training, evaluating, and interpreting the RF model trained on reaction center ISIDA fragments.

Scripts

This folder contains three subdirectories, with each containing python scripts for training, evaluation, and interpretation of the RF and BERT models (including general utilities): "Training_and_evaluation", "Interpretation" and "Utils". The files that are contained within each of these subdirectories are as follows:

Training and evaluation

Python scripts for training and evaluating the BERT and RF models, and the RF model trained on reaction center ISIDA fragments: "BERT", "RF", and "RF_rxn_center_only", respectively
Example python script for plotting the learning curves

Interpretation

Python scripts for interpreting predictions made by the BERT and RF models: "BERT" and "RF"
Python scripts relevant to interpreting predictions made by both the BERT and RF models: "General"

Utils

Example python script for checking whether uncertainties associated with feature importances are uncorrelated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Interpreting ML kinetics

TMAPs

Data

DFT

ML_datasets

ML_analysis_and_results

Scripts

Training and evaluation

Interpretation

Utils

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
Data		Data
Scripts		Scripts
.gitattributes		.gitattributes
Accurate_predictions_TMAP.html		Accurate_predictions_TMAP.html
Inaccurate_predictions_TMAP.html		Inaccurate_predictions_TMAP.html
README.md		README.md
Test_data_TMAP.html		Test_data_TMAP.html
Train_data_TMAP.html		Train_data_TMAP.html
accurate_predictions.js		accurate_predictions.js
inaccurate_predictions.js		inaccurate_predictions.js
test_distribution.js		test_distribution.js
train_distribution.js		train_distribution.js

duartegroup/InterpretingMLKinetics

Folders and files

Latest commit

History

Repository files navigation

Interpreting ML kinetics

TMAPs

Data

DFT

ML_datasets

ML_analysis_and_results

Scripts

Training and evaluation

Interpretation

Utils

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages