Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Details in train.py output #58

Open
JonasLi-19 opened this issue Oct 6, 2023 · 21 comments
Open

Details in train.py output #58

JonasLi-19 opened this issue Oct 6, 2023 · 21 comments
Assignees

Comments

@JonasLi-19
Copy link

JonasLi-19 commented Oct 6, 2023

Dear professor and developers, I come up against a few questions about the procedures in training.

  1. I am not sure whether I should make these six folds almost the same sample amount, or should every train:test be 8:2 ?

  2. After I train the model, one of the default2018*.out files contains: 0.801041 0.795815 0.558788 0.010000 1.746990 1.742589. the first two metirc are sequentially max_AUC min_AUC ... WAHT ARE THE FOUR NUMBERS.

@RJ-Li
Copy link

RJ-Li commented Oct 7, 2023

For the second question, you can find in the passage:

When evaluating a test set without counterexamples, models trained with counterexamples perform worse than their
counterparts trained without them at the pose selection task (0.885 to 0.845 AUC and 0.577 to 0.556 Top1), ...performance
improvements in affinity prediction (0.577 to 0.587 Pearson’s R and 1.463 to 1.457 RMSE) 

@drewnutt
Copy link

Dear professor and developers, I come up against a few questions about the procedures in training.

1. When I split my whole new dataset in train&validation, I use the default n_folds of '0,1,2' : `train_0, train_1, train_2,test_0,test_1,test_2`. However, I am not sure whether I should make these six folds almost the same sample amount, or should every train:test be 8:2 ?

2. After I train the model, one of the `default2018*.out` files contains: ` 0.801041 0.795815 0.558788 0.010000 1.746990 1.742589`. As I guess, the first two metirc are sequentially max_AUC min_AUC ... BUT I DO NOT KNOW WAHT ARE THE FOLLOWING FOUR NUMBERS.
  1. The folds used in the paper were created via clustering of the protein pockets via the ProBIS algorithm. The clusters were then randomly assigned to folds. So for each fold in 0,1, and 2 the testing set is composed of only 1 fold and the training set is the other 2 folds. Each fold is about the same size in terms of number of examples, so the train test split is ~2:1.

  2. The out file contains the following, in order: test_AUC, train_AUC, train_loss, learning_rate,test_RMSD_affinity, train_RMSD_affinity

    • If the model predicts the RMSD of a ligand, then there will be two more columns appended: test_RMSD_RMSE, train_RMSD_RMSE

@drewnutt
Copy link

I want to know more about the output rmsd, I didn't find an explanation for the default2018.***.rmsd.test number's meaning in the README.

image

I am not sure what default2018.***.rmsd.test is referring to here.

train.py will output a *.rmsd.finaltest at the end of training, that lists the affinity label and prediction for each example in the training set and the last line will show the RMSD of the affinity prediction on the test set.

@JonasLi-19
Copy link
Author

JonasLi-19 commented Nov 2, 2023

If I want to train default2018.model on the crystal dataset, which is full of crystal ligands with RMSD=0. should I delete the RMSD column in types file, and set has_rmsd=false in the model?

@JonasLi-19
Copy link
Author

I ask this question how to train on crystal dataset is because of the error I encountered:

I1102 07:12:28.331017 47833 layer_factory.hpp:77] Creating layer data
I1102 07:12:28.331130 47833 net.cpp:85] Creating Layer data
I1102 07:12:28.331151 47833 net.cpp:385] data -> data
I1102 07:12:28.331199 47833 net.cpp:385] data -> label
I1102 07:12:28.331220 47833 net.cpp:385] data -> affinity
I1102 07:12:28.331240 47833 net.cpp:385] data -> rmsd_true
Traceback (most recent call last):
  File "train.py", line 936, in <module>
    results = train_and_test_model(args, train_test_files[i], outname, cont)
  File "train.py", line 441, in train_and_test_model
    solver = caffe.get_solver(solverf)
ValueError: No valid examples found in training set.

My demand is python3 train.py -p crystal_2019_ -d /root/gnina_docker -m default2018.model --weights crossdock_default2018.caffemodel --dynamic
image

1 9.84 0 pdb2019_refi_train_gninatypes/2xef/2xef_rec.gninatypes pdb2019_refi_train_gninatypes/2xef/2xef_ligand.gninatypes
1 7.55 0 pdb2019_refi_train_gninatypes/3ni5/3ni5_rec.gninatypes pdb2019_refi_train_gninatypes/3ni5/3ni5_ligand.gninatypes
1 7.54 0 pdb2019_refi_train_gninatypes/4hu1/4hu1_rec.gninatypes pdb2019_refi_train_gninatypes/4hu1/4hu1_ligand.gninatypes

The rec and ligand gninatypes are in the correct address:
image

I am pretty sure the files are existed, because when I train other dataset with crystal gninatypes, they don't raise error!
SO, I guess, If I want to train default2018.model on the crystal dataset, which is full of crystal ligands with RMSD=0. should I delete the RMSD column in types file, and set has_rmsd=false in the model?

@dkoes
Copy link
Contributor

dkoes commented Nov 2, 2023

If you are using default2018 unaltered it has balancing and stratification on be default, which makes no sense with an all crystal dataset.

@SanFran-Me
Copy link

Dear Professor, @dkoes
I trained crossdock_default2018 on PDBbind2018_refined-set, hoping it will be more powerful.
Actually it indeed perform better scoring CASF-2016, but when I use the result caffemodel to redock the PDBbind2018_refined, its output docked poses are more than 70% with more than 2A (TOP5 success rate around 30%).
What do you think of this result? I can only guess it is because the training set only contains label 1, so the CNNscore was trained to be worse?

@dkoes
Copy link
Contributor

dkoes commented Nov 4, 2023

If you aren't training for pose selection, you won't get a model with good pose selection performance.

@Dadiao-shuai
Copy link

Hi, I noticed the --avg_rotations to test in the val-set for 24 rotations and get avg. My question is 'Does gnina train.py automatically use 24 rotations to train? if so, how can I increase this rotations num?'

@dkoes
Copy link
Contributor

dkoes commented Nov 28, 2023

We train with random rotations enabled. There isn't a fixed number of rotations.

@Dadiao-shuai
Copy link

What should I do when generating gninatypes files if I couldn't calculate the RMSD?
I'm using default2018.model and its caffemodel to train on a new dataset. The only thing I find is to change: has_rmsd: true in default2018.model. IS THERE ANYTHING ELSE TO CHANGE?


layer {
  name: "data"
  type: "MolGridData"
  top: "data"
  top: "label"
  top: "affinity"
  top: "rmsd_true"
  include {
    phase: TEST
  }
  molgrid_data_param {
        source: "TESTFILE"
        batch_size: 50
        dimension: 23.5
        resolution: 0.500000
        shuffle: false
        ligmap: "completelig"
        recmap: "completerec"
        balanced: false
        has_affinity: true
        has_rmsd: true
        root_folder: "DATA_ROOT"
    }
  }
  
layer {
  name: "data"
  type: "MolGridData"
  top: "data"
  top: "label"
  top: "affinity"
  top: "rmsd_true"
  include {
    phase: TRAIN
  }
  molgrid_data_param {
        source: "TRAINFILE"
        batch_size:  50
        dimension: 23.5
        resolution: 0.500000
        shuffle: true
        balanced: true
        jitter: 0.000000
        ligmap: "completelig"
        recmap: "completerec"        
        stratify_receptor: true
        stratify_affinity_min: 0
        stratify_affinity_max: 0
        stratify_affinity_step: 1.000000
        has_affinity: true
        has_rmsd: true
        random_rotation: true
        random_translate: 6
        root_folder: "DATA_ROOT"       
    }
}
...
layer {
  name: "pose_output"
  type: "InnerProduct"
  bottom: "split"
  top: "pose_output"
  inner_product_param {
    num_output: 2
    weight_filler {
      type: "xavier"
    }
  }
}
...
layer {
  name: "output"
  type: "Softmax"
  bottom: "pose_output"
  top: "output"
}
layer {
  name: "labelout"
  type: "Split"
  bottom: "label"
  top: "labelout"
  include {
    phase: TEST
  }
}
layer {
  name: "rmsd"
  type: "AffinityLoss"
  bottom: "affinity_output"
  bottom: "affinity"
  top: "rmsd"
  affinity_loss_param {
    scale: 0.1
    gap: 0
    pseudohuber: false
    delta: 4
    penalty: 0
    ranklossmult: 0
    ranklossneg: 0    
  }
}

@dkoes
Copy link
Contributor

dkoes commented Dec 19, 2023

Don't include the field in the input file either. What error do you get when you try it?

@Dadiao-shuai
Copy link

Thanks, I forget to exclude the rmsd field in the gninatypes.

@Dadiao-shuai
Copy link

failure when training on non-rmsd types

error:

I1219 08:58:55.233186 52309 layer_factory.hpp:77] Creating layer data
I1219 08:58:55.233378 52309 net.cpp:85] Creating Layer data
I1219 08:58:55.233404 52309 net.cpp:385] data -> data
I1219 08:58:55.233462 52309 net.cpp:385] data -> label
I1219 08:58:55.233489 52309 net.cpp:385] data -> affinity
I1219 08:58:55.233507 52309 net.cpp:385] data -> rmsd_true
F1219 08:58:55.233546 52309 layer.hpp:372] Check failed: ExactNumTopBlobs() == top.size() (3 vs. 4) MolGridData Layer produces 3 top blob(s) as output.
*** Check failure stack trace: ***
Aborted (core dumped)

this is my train|test types example: label, affinity, rec, lig
0 7.78 2020_m_pocket_gninatypes/5llo_m_pocket.gninatypes 2020_lig_mol2_gninatypes/5llo_ligand.gninatypes

this is the default2018.model, I set has_rmsd=false.

layer {
  name: "data"
  type: "MolGridData"
  top: "data"
  top: "label"
  top: "affinity"
  top: "rmsd_true"
  include {
    phase: TEST
  }
  molgrid_data_param {
        source: "TESTFILE"
        batch_size: 50
        dimension: 23.5
        resolution: 0.500000
        shuffle: false
        ligmap: "completelig"
        recmap: "completerec"
        balanced: false
        has_affinity: true
        has_rmsd: false
        root_folder: "DATA_ROOT"
    }
  }
  
layer {
  name: "data"
  type: "MolGridData"
  top: "data"
  top: "label"
  top: "affinity"
  top: "rmsd_true"
  include {
    phase: TRAIN
  }
  molgrid_data_param {
        source: "TRAINFILE"
        batch_size:  50
        dimension: 23.5
        resolution: 0.500000
        shuffle: true
        balanced: true
        jitter: 0.000000
        ligmap: "completelig"
        recmap: "completerec"        
        stratify_receptor: true
        stratify_affinity_min: 0
        stratify_affinity_max: 0
        stratify_affinity_step: 1.000000
        has_affinity: true
        has_rmsd: false
        random_rotation: true
        random_translate: 6
        root_folder: "DATA_ROOT"       
    }
} 

@Dadiao-shuai
Copy link

Dadiao-shuai commented Dec 19, 2023

Thanks for your suggestion but it doesn't work.
I delete the 'top: "rmsd_true"' in data layer (both TRAIN and TEST), then set the "has_rmsd=false"
Then I delete the rmsd layer.
But it reported error:

I1219 13:13:31.414115 52604 solver.cpp:57] Solver scaffolding done.
I1219 13:13:31.420727 52604 net.cpp:760] Ignoring source layer affinity_output_affinity_output_0_split
I1219 13:13:31.420799 52604 net.cpp:760] Ignoring source layer rmsd
I1219 13:13:31.425463 52604 solver.cpp:352] Iteration 0, Testing net (#0)
I1219 13:13:31.425810 52604 solver.cpp:352] Iteration 0, Testing net (#1)
Traceback (most recent call last):
  File "train.py", line 934, in <module>
    results = train_and_test_model(args, train_test_files[i], outname, cont)
  File "train.py", line 501, in train_and_test_model
    solver.step(test_interval)
ValueError: No valid stratified examples.

I am not sure what error it is.

My types have four columns (label affinity rec_gninatypes lig_gninatypes) which is devided by space, I am sure with no problem.
And I don't understand :"Ignoring source layer affinity_output_affinity_output_0_split"


It comes up to me that my command is:
train.py -m default2018.model --weights crossdock_default2018.caffemodel -d /path/... -p ... --dynamic

So I think the caffemodel given in https://github.com/gnina/models/tree/master/crossdocked_paper is not compatible for rmsd-free model? If so, I couldn't train the rmsd-free default2018.model because its corresponding crossdock_default2018.caffemodel is not given in gnina/models/acs2018.

@dkoes
Copy link
Contributor

dkoes commented Dec 19, 2023

Presumably you have stratification on, but it isn't possible to fill the strata. Turn it off.

@SanFran-Me
Copy link

If you aren't training for pose selection, you won't get a model with good pose selection performance.

So in gnina, if I want to train a model with good pose selection ability, I need train on dataset with various RMSD and label poses?
So if I did not use RMSD colomn in types when training, but use label(0/1), the pose selection(CNNscore) will perform worse?

@Dadiao-shuai
Copy link

Dadiao-shuai commented Dec 24, 2023

So in gnina, if I want to train a model with good pose selection ability, I need train on dataset with various RMSD and label poses? So if I did not use RMSD colomn in types when training, but use label(0/1), the pose selection(CNNscore) will perform worse?

@SanFran-Me As I remember, the docked poses are ranked by CNNscore. I guess if you set CNNaffinity as the index to rank the docked poses, things will be different.
By the way, I could not find how to set CNNaffinity as the docked poses' rank index.

@dkoes
Copy link
Contributor

dkoes commented Jan 10, 2024

The CNNscore is trained as a classifier; it does not use the RMSD, although successive work has indicated that would be a good idea.

@Dadiao-shuai
Copy link

Hi, I'm wondering if the RMSD doesn't affect the CNNscore classifier, what is its function? Only to help tp stratify the inputs?

@dkoes
Copy link
Contributor

dkoes commented Jan 11, 2024

It wasn't used in the distributed models, but we've used it in training other models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants