-
Notifications
You must be signed in to change notification settings - Fork 12k
llama: Attempt to add ModernBert #14014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The embedding result seems random and very low. There is something wrong with this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete the files you added in models, we don't need them, just make sure test-tokenizer-0
succeeds with the GGUF.
src/llama-model.cpp
Outdated
inpL = build_norm(inpL, model.tok_norm, nullptr, LLM_NORM, -1); | ||
cb(inpL, "inp_norm", -1); | ||
|
||
auto * inp_attn = build_attn_inp_kv_unified_iswa(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably become:
auto * inp_attn = build_attn_inp_kv_unified_iswa(); | |
auto * inp_attn = build_attn_inp_no_cache_iswa(); |
And add the corresponding mask logic in llama-graph
. Special attention should be taken about how the SWA works for this model - i.e. is it symmetric or not:
# non-symmetric
token i attends to [i - n_swa, i]
# symmetric:
token i attends to [i - n_swa/2, i + n_swa/2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to add the new arch here:
Lines 13195 to 13203 in 5a8ae30
switch (arch) { | |
case LLM_ARCH_BERT: | |
case LLM_ARCH_JINA_BERT_V2: | |
case LLM_ARCH_NOMIC_BERT: | |
case LLM_ARCH_NOMIC_BERT_MOE: | |
case LLM_ARCH_WAVTOKENIZER_DEC: | |
{ | |
res = nullptr; | |
} break; |
To avoid creating a memory module (a.k.a. KV cache) for these models.
So, since vocab is BPE you need to add Line 1557 in 9f47fa5
Set correct attribute on [MASK] token, similarly to this:Lines 2097 to 2105 in 9f47fa5
|
Yep, I also noticed the same with |
@huydt84 Don't forget this-^ it's important. |
Will dig into this tonight/this weekend... |
Thank you! I have just added it |
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add new enum llama_swa_type
:
LLAMA_SWA_TYPE_SYMMETRIC = 3,
inpL = build_norm(inpL, model.tok_norm, nullptr, LLM_NORM, -1); | ||
cb(inpL, "inp_norm", -1); | ||
|
||
auto * inp_attn = build_attn_inp_no_cache_iswa(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is not an actual iSWA (interleaved SWA) model, we should use simply build_attn_inp_no_cache()
.
@@ -241,6 +249,7 @@ class llm_graph_input_attn_no_cache : public llm_graph_input_i { | |||
|
|||
const llama_hparams & hparams; | |||
const llama_cparams & cparams; | |||
const int n_swa; // Sliding window attention size (0 = disabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already available from the hparams
- no need to duplicate it here.
// Check if we're using sliding window attention | ||
if (n_swa > 0) { | ||
const int64_t n_tokens = ubatch->n_tokens; | ||
const int64_t n_seq_tokens = ubatch->n_seq_tokens; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch is actually non-causal
attention + sliding window. So merge it with the existing implementation below.
I don't know whether my implementation is correct or not