Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Details of the affnity labels in data #26

Open
JonasLi-19 opened this issue Jul 22, 2023 · 12 comments
Open

Details of the affnity labels in data #26

JonasLi-19 opened this issue Jul 22, 2023 · 12 comments
Assignees

Comments

@JonasLi-19
Copy link

JonasLi-19 commented Jul 22, 2023

when use *_min poses as part of the training set

I noticed that in *.types files in PDBbind2016, you use *_min poses as part of the train data. Then how do you define their affinity label? Did you just assign those minimized poses the same affinity with the crystal poses? And other docked poses just set to the corresponding negative number?

Why the second column has positive and negetive nubers for ligand and docked_poses?

<label> <pK> <RMSD to crystal> <Receptor filename> <Ligand filename> # <Autodock Vina score>>
1 3.28 0.908077 3zsx/3zsx_rec_0.gninatypes 3zsx/3zsx_min_0.gninatypes # -6.89469
0 -3.28 4.7514 3zsx/3zsx_rec_0.gninatypes 3zsx/3zsx_docked_0.gninatypes # -7.84082
0 -3.28 3.89599 3zsx/3zsx_rec_0.gninatypes 3zsx/3zsx_docked_1.gninatypes # -7.43202
0 -3.28 6.06622 3zsx/3zsx_rec_0.gninatypes 3zsx/3zsx_docked_2.gninatypes # -7.10783
0 -3.28 7.9518 3zsx/3zsx_rec_0.gninatypes 3zsx/3zsx_docked_3.gninatypes # -7.03943

@dkoes
Copy link
Contributor

dkoes commented Jul 24, 2023

We train with an L2 loss on "good" low RMSD (<2A) poses and a hinge loss on "bad" poses (as the predicted affinity should not be greater than the true affinity in this case, but it isn't reasonable to expect the correct affinity from a bad pose).

The negative value is used by our AffinityLoss function to identify that the hinge loss should be applied.

@JonasLi-19
Copy link
Author

I am confused about the following statement about the dataset setting in the Paper ThreeDimensional Convolutional Neural Networks and a CrossDocked Data Set for Structure-Based Drug Design about Training on Docked Poses Increases Pose Sensitivity.
It says that: Figure 2 shows the results of this analysis for the Def2018 model trained with either the Refined\Core Crystal set or the Refined\Core Docked set and tested on the Core set made up of docked poses.

From my view, the Core set made up of docked poses means you use the trained def2018 model to dock the 285 complexes in Coreset, and generated numerous docked poses for each complex, NOT directly predict the docked poses' affinity, right?

image

@JonasLi-19
Copy link
Author

I now know that you use hinge_loss for all the docked_poses with >2A RMSD that were labeled as negative, are their CNNscore all >0.9 so they are all counterexamples?
Or you just use hinge_loss for docked_poses with >2A RMSD?

@SanFran-Me
Copy link

In my opinion, the importance of using hinge loss for docked_poses with RMSD>2A is its ability to avoid over-penalty for bad samples. Though I didn't find this in the paper.

by the way, if you only want to train the model to predict affinity, do you have to generate counter-examples? As the paper says, the counter-example is (CNNscore>0.9 but RMSD >2; CNNscore<0.5 but RMSD <2), it has nothing to do with CNNaffinity. So I deem that you don't have to generate counter-ezamples if you only want to train on thoses docked poses to predict affinity. @JonasLi-19

@dkoes
Copy link
Contributor

dkoes commented Aug 4, 2023

For training and evaluation in this paper, all poses are already docked - we do not use trained models to perform docking, only rescoring.

The hinge loss is only for affinity prediction because a bad pose should not have a good affinity.

Counter examples as described here are for training the pose scoring model.

@JonasLi-19
Copy link
Author

For training and evaluation in this paper, all poses are already docked - we do not use trained models to perform docking, only rescoring.

The hinge loss is only for affinity prediction because a bad pose should not have a good affinity.

Counter examples as described here are for training the pose scoring model.

Thanks, I see, you made an assumption that poses within 2A has the nearly the same affinity of crystal poses, so you just set their pk as the crystal pk.

I should find it earlier as the types file already tells anything:
image

@JonasLi-19
Copy link
Author

Hi Professor @dkoes , I couldn't figure out why some of the crystal ligands are negative in pk (as you said the bad poses are negative, but why crystal poses could also be negative?)
data/PDBBind2016/Refined_types/ref2_crystal_train0.types
image

I couldn't find the explanation in the paper, so I have to ask you here.

@SanFran-Me Thank you for your opinion, I agree that I don't have to care about generating the counterexamples if I just care about the CNNaffinity accuracy performance.

@dkoes
Copy link
Contributor

dkoes commented Aug 7, 2023

That's a good question - @francoep ?

@francoep
Copy link
Collaborator

francoep commented Aug 7, 2023

Bug from file generation would be my guess. By definition they should all be 1 and non-negative

@francoep
Copy link
Collaborator

francoep commented Aug 8, 2023

I uploaded a fixed version of the types file. I also re-ran the training of the Default 2018 model on the PDBbind crystal data, test on the core set.

Paper reported Def2018 model RMSE - 1.500325, Pearson R - 0.734269 (Table 3)
Newly trained Def2018 model RMSE - 1.440790, Pearson R - 0.764837.

This minor bump in performance does not change the general results reported in the CrossDocked2020 paper, which is that training on the general set with docked poses was better than training on the refined set's crystal poses.

@SanFran-Me
Copy link

I uploaded a fixed version of the types file. I also re-ran the training of the Default 2018 model on the PDBbind crystal data, test on the core set.

Paper reported Def2018 model RMSE - 1.500325, Pearson R - 0.734269 (Table 3) Newly trained Def2018 model RMSE - 1.440790, Pearson R - 0.764837.

This minor bump in performance does not change the general results reported in the CrossDocked2020 paper, which is that training on the general set with docked poses was better than training on the refined set's crystal poses.

Good for you to find out bugs and improve your models, but how could I use your newly generated model_file and weights file to perform traning.py? Where is them? Do I need to totally delete my current gnina, and compile and install from the start again?

@francoep
Copy link
Collaborator

francoep commented Aug 9, 2023

The model file is unchanged (it only describes the network setup, which is just the default2018 architecture), and when training you generate your own weights files. This setup of train on pdbbind refined crystals is not a part of gnina at all (and I would recommend against using such a training setup at all)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants