Transfer Learning with Custom Empty Model + New Tokens #218

JanoschMenke · 2025-03-25T19:13:43Z

JanoschMenke
Mar 25, 2025

Hi,

I was just wondering, how one approaches the use of the standardization option when training a new prior from scratch. I want to for example introduce new tokens currently.
When I create the new empty model I need to turn off the standardization method, otherwise some of the SMILES will not be processed and their tokens not included in the vocabulary, right?
When then doing transfer learning to train the model, I also need to turn off standardization or will the standardization in the TL take into account the vocab of the empty model?

Maybe I am overlooking something?

Best.
Janosch

Answered by halx

Mar 26, 2025

I guess you mean preprocess.py from the data pipeline. In this case you need to set standardize = false for both empty model generation and TL. The built-in default filter has its own set of rules e.g. your first SMILES has a boron. It aslo has a few quirky other rules.

View full answer

halx · 2025-03-26T08:05:17Z

halx
Mar 26, 2025
Maintainer

Hi Janosch,

the idea of standardization is to bring the input SMILES into, as it says, a standard format including validation for chemical validity according to RDKit's valence model. REINVENT will generate SMILES as trained in the model but then will determine validity using RDKit. So, you will have to ensure that the input SMILES are in a form acceptable to RDKit. Also, the assumption is made that scoring functions, in particular predictive models, are RDKit-SMILES based. We have a facility though to convert SMILES before passing them on e.g. the Lilly scoring components in contrib are based on their own chemoinformatic toolkit.

In practice, you would need to ensure to read in either a pre-standardized SMILES (set standarize to false) file or use the built-in facility for that. Also make sure to use the same file for both empty model creation and TL training, and set standarize consistently to the same value. If you split the SMILES file into training and validation set, create the empty model on the combined set, i.e. do the split after creating the empty model as it may occur that tokens in the validation set would not be present in the training set.

Having said that, we have started to implement a new data pipeline. In the newest release of REINVENT we have a new command reinvent_datapre for this which standardizes SMILES and writes out a SMILES files. This would then be used as above. We do not have a better solution yet to do that in one go. You would also have to prepare any other SMILES input in the same way because we do not have yet implemented that yet.

Cheers.
Hannes.

7 replies

JanoschMenke Mar 26, 2025
Author

Hi Hannes,

thanks for detailed response. I prepared my dataset with the preprocess.py script.
So I assume that chemical validity as defined by RDKit is given.
However, when I build the empty model with standardize = True I get errors for SMILES that are valid SMILES
'default' filter c1(cc(cc2B(O)OC(c21)CN)OCC)Br is invalid.

Here I assume the issue are the atom types that are not allowed by default REINVENT?

For the rest I did follow the "best practices" of training the model.

JanoschMenke Mar 26, 2025
Author

Ah sorry I also see warning for molecules that should pass the default filter.

<WARN> "default" filter: N(CCCNc1c2c(nc3ccccc13)CCCC2)C(CCS)=O is invalid

halx Mar 26, 2025
Maintainer

I guess you mean preprocess.py from the data pipeline. In this case you need to set standardize = false for both empty model generation and TL. The built-in default filter has its own set of rules e.g. your first SMILES has a boron. It aslo has a few quirky other rules.

Answer selected by JanoschMenke

JanoschMenke Mar 26, 2025
Author

Yes sorry I was talking about the preprocess.py from the data pipeline.
Awesome, that solves my questions. Thanks again for taking the time to help out.

Lastly, just something that I noticed in Transfer Learning is that when you set standardize = False and randomize = True, it can happen that do to the randomization, we enumerate SMILES with tokens that have previously not seen and are hence not in the vocabulary. These are mainly ring counters, where sometimes not enough rings are closed before new ones are opend.

In this case the training will fail. I guess one could introduce a flag that would force a re-randomization or skip this smiles during training.

halx Mar 26, 2025
Maintainer

Thanks for the hint. We have observed this before but never gotten around to look into this. You say "mainly", what other unknown tokens have you seen?

JanoschMenke Mar 26, 2025
Author

Sorry you are right. So far I have only noticed ring sizes. I cant think of any other unseen token that should be created due to randomization

halx Mar 26, 2025
Maintainer

Many thanks, again!

Transfer Learning with Custom Empty Model + New Tokens #218

Uh oh!

JanoschMenke Mar 25, 2025

Replies: 1 comment · 7 replies

Uh oh!

halx Mar 26, 2025 Maintainer

Uh oh!

Uh oh!

JanoschMenke Mar 26, 2025 Author

Uh oh!

Uh oh!

JanoschMenke Mar 26, 2025 Author

Uh oh!

Uh oh!

halx Mar 26, 2025 Maintainer

Uh oh!

JanoschMenke Mar 26, 2025 Author

Uh oh!

halx Mar 26, 2025 Maintainer

Uh oh!

JanoschMenke Mar 26, 2025 Author

Uh oh!

halx Mar 26, 2025 Maintainer

JanoschMenke
Mar 25, 2025

Replies: 1 comment 7 replies

halx
Mar 26, 2025
Maintainer

JanoschMenke Mar 26, 2025
Author

JanoschMenke Mar 26, 2025
Author

halx Mar 26, 2025
Maintainer

JanoschMenke Mar 26, 2025
Author

halx Mar 26, 2025
Maintainer

JanoschMenke Mar 26, 2025
Author

halx Mar 26, 2025
Maintainer