This repository contains the data and code for the work entitled "Interpretable predictions of SN2 kinetics using the BERT and RF architectures: Comparison to known reactivity rules".
Interactive TMAPs of the training and test data, and accurate and inaccurate predictions are provided as .html files. The following .js files should be downloaded in the same directory as the .html files:
-
accurate_predictions.js, Accurate_predictions_TMAP.html
-
inaccurate_predictions.js, Inaccurate_predictions_TMAP.html
-
test_distribution.js, Test_data_TMAP.html
-
train_distribution.js, Train_data_TMAP.html
This folder contains three subdirectories, with each containing the data and results for the DFT and ML calculations: "DFT", "ML_analysis_and_results" and "ML_datasets". The files that are contained within each of these subdirectories are as follows:
-
Input and output files from DFT calculations that were executed using the autodE interface: "autodE_inps" and "autodE_outs"
-
Input and output files from DFT caculations that were executed without using the autodE interface: "ORCA_SPECIES_CALCULATION_TYPE_inps" and "ORCA_SPECIES_CALCULATION_TYPE_outs"
-
Output code from frequency shift calculations in OTherm: "OTherm_outs"
-
The DFT dataset: "Data_for_dft.xlsx"
-
Rotational symmetry numbers of species whose thermochemistry was calculated without using the autodE interface: "SPECIES_symmetry_numbers.csv"
-
Summary of the main results from DFT calculations that were executed without using the autodE interface: "Iteration_of_DFT_outs.xlsx"
This subdirectory contains the input data for each of the trained ML models. "Input_data_DATASET_NAME.xlsx" is the raw dataset with reactions in SMILES format, solvent names in text format, and logk, temperature, solvent mole fractions, and ionic strength in float format. "Total_test_processed.xlsx" is the total test data with SMILES of each species in the reaction separated into their own columns, and "Total_train_unstandardized_isida.xlsx" is the unstandardized ISIDA fragment representation of the total train data. Model inputs are provided in the following folders:
-
Datasets in .rdf format for input into the RF training scripts: "rdfs"
-
Cross-validation folds of datasets featurized for the RF model, generated by the RF training scripts: "RF_cv_splits"
-
Cross-validation folds of datasets featurized for the BERT model: "BERT"
-
Input data for the RF model trained on reaction center ISIDA fragments: "RF_rxn_center_only"
This subdirectory contains three folders: "RF", "BERT", and "General".
"General" contains the data relevant to the analysis of both RF and BERT:
-
Results from atom mapping calculations carried out on the test data: "Atom_mapping"
-
Results from reaction center substructure matching to the representative examples of LG, steric, nucleophilic, and allylic effects: "Representative examples"
"RF" and "BERT" both have the following structure:
-
Data and results from interpreting model predictions: "MODEL_interpretation"
-
Data and results for representative examples of LG, steric, nucleophilic, and allylic effects: "MODEL_representative_examples"
-
Results from training and evaluating the models: "MODEL_training_and_evaluation"
"RF" also contains the folder "Rxn_center_only" which countains the data and results from training, evaluating, and interpreting the RF model trained on reaction center ISIDA fragments.
This folder contains three subdirectories, with each containing python scripts for training, evaluation, and interpretation of the RF and BERT models (including general utilities): "Training_and_evaluation", "Interpretation" and "Utils". The files that are contained within each of these subdirectories are as follows:
-
Python scripts for training and evaluating the BERT and RF models, and the RF model trained on reaction center ISIDA fragments: "BERT", "RF", and "RF_rxn_center_only", respectively
-
Example python script for plotting the learning curves
-
Python scripts for interpreting predictions made by the BERT and RF models: "BERT" and "RF"
-
Python scripts relevant to interpreting predictions made by both the BERT and RF models: "General"
- Example python script for checking whether uncertainties associated with feature importances are uncorrelated