Clarification on Fine-Tuning with Special Tokens (e.g., <cas9>) in Evo Model #84

MarcAmil30 · 2024-09-12T08:36:07Z

MarcAmil30
Sep 12, 2024

Hello,

In the paper, it states: "During pretraining, Evo uses an effective vocabulary of four tokens, one per base, from a total vocabulary of 512 characters. We use the additional characters to enable prompting with special tokens during generation with fine-tuned models."

I wanted to clarify the fine-tuning approach, particularly for the CRISPR-Cas application. Did you append prompt tokens like or to their respective sequences, e.g., ATAGCA...? If so, were the prompt tokens (e.g., <, c, a, s, 9, >) already part of the 512-character vocabulary, or did you need to expand or retrain the model to include these new tokens?

I’m also curious whether the model would need to learn how to handle these characters during fine-tuning, since they were not part of the sequences used in pretraining.

Thank you!

brianhie · 2024-09-12T15:31:10Z

brianhie
Sep 12, 2024
Maintainer

We prepended a special token already in the vocab during finetuning. For cas, these are:

CAS_ID_TO_START_TOKEN = {
    'Cas9': '`',
    'Cas12': '!',
    'Cas13': '@',
}

The model did not see these during pretraining but did see them (only in the first char position) during finetuning!

2 replies

MarcAmil30 Sep 12, 2024
Author

Hello,

Thank you for your response! Just to confirm, during fine-tuning, an example sequence for Cas12 would look like !ATACGA..., with the ! token prepended. Is that correct?

Additionally, I wanted to clarify the 512-token vocabulary. Are characters such as [,], capital letters (aside from A, C, T, G), and numbers included in this vocabulary? How would I be able to see the full set of characters used in the vocabulary?

Thank you!

UsmanMahmood8 Sep 20, 2024

For DNA, what were the start, end, and pad tokens? If any.
I see that for score function bos token as 0 and pad token as 1 is used. But for generation example bos token is not being used. Is there a reason for that? Why special tokens are not being prepend in the geneeration example?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on Fine-Tuning with Special Tokens (e.g., <cas9>) in Evo Model #84

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Clarification on Fine-Tuning with Special Tokens (e.g., <cas9>) in Evo Model #84

MarcAmil30 Sep 12, 2024

Replies: 1 comment · 2 replies

brianhie Sep 12, 2024 Maintainer

MarcAmil30 Sep 12, 2024 Author

UsmanMahmood8 Sep 20, 2024

MarcAmil30
Sep 12, 2024

Replies: 1 comment 2 replies

brianhie
Sep 12, 2024
Maintainer

MarcAmil30 Sep 12, 2024
Author