Clarification on Fine-Tuning with Special Tokens (e.g., <cas9>) in Evo Model #84
Unanswered
MarcAmil30
asked this question in
Q&A
Replies: 1 comment 2 replies
-
We prepended a special token already in the vocab during finetuning. For cas, these are: CAS_ID_TO_START_TOKEN = {
'Cas9': '`',
'Cas12': '!',
'Cas13': '@',
} The model did not see these during pretraining but did see them (only in the first char position) during finetuning! |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
In the paper, it states: "During pretraining, Evo uses an effective vocabulary of four tokens, one per base, from a total vocabulary of 512 characters. We use the additional characters to enable prompting with special tokens during generation with fine-tuned models."
I wanted to clarify the fine-tuning approach, particularly for the CRISPR-Cas application. Did you append prompt tokens like or to their respective sequences, e.g., ATAGCA...? If so, were the prompt tokens (e.g., <, c, a, s, 9, >) already part of the 512-character vocabulary, or did you need to expand or retrain the model to include these new tokens?
I’m also curious whether the model would need to learn how to handle these characters during fine-tuning, since they were not part of the sequences used in pretraining.
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions