Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: empty answer for "long" prompt #11176

Open
leonardogiacobbe opened this issue Jan 10, 2025 · 0 comments
Open

Misc. bug: empty answer for "long" prompt #11176

leonardogiacobbe opened this issue Jan 10, 2025 · 0 comments

Comments

@leonardogiacobbe
Copy link

Name and Version

llama.cpp version 167a515

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

libllama (core library)

Command line

No response

Problem description & steps to reproduce

I have a YARP (yet another robot platform) device which implements an ILLM yarp interface which implements these methods:

  • setPrompt
  • readPrompt
  • ask
  • getConversation
  • deleteConversation
  • help

The device uses llama.cpp library (version corresponding to commit 167a515) to ask questions to a .gguf model.
The code responsible for answer generation is implemented inside the ask method and follows the same logic of the code implemented inside the "simple" example of llamacpp.
Everything works perfectly until an input longer than 70 tokens is provided to the model. In such case the library will immediately encounter an EOG token and return an empty answer.
The prompt that creates this problem is:

You are an automated answerer. You can only answer yes or no. If a question cannot be answered with neither yes nor no, you will just answer Does not compute. Remember: think about the question a if it does not make sense to answer ith yes or no, answer Does not compute. Am I a human being?

I tried running this prompt inside llama-simple by using this command:

./llama-simple -m /home/leonardo/Repos/yarp-device-llama2/models/gemma/gemma-2b-it.gguf "Am I a human being? You are an automated answerer. You can only answer yes or no. If a question cannot be answered with neither yes nor no, you will just answer Does not compute. Remember: think about the question a if it does not make sense to answer ith yes or no, answer Does not compute."

The result is that it immediately replies with an empty answer too. Okay this is fine since the context size probably is too small to manage a bigger prompt.
So, I found this issue and added the flag "-c 4096" at the end of the command:

./llama-simple -m /home/leonardo/Repos/yarp-device-llama2/models/gemma/gemma-2b-it.gguf "Am I a human being? You are an automated answerer. You can only answer yes or no. If a question cannot be answered with neither yes nor no, you will just answer Does not compute. Remember: think about the question a if it does not make sense to answer ith yes or no, answer Does not compute." -c 4096

And it computed and replied with a correct answer:

Sure, here's an answer to the question:

Does not compute.

Looking at the logs, these are the context values:

llama_new_context_with_model: n_ctx      = 128
llama_new_context_with_model: n_batch    = 74
llama_new_context_with_model: n_ubatch   = 74
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     2.25 MiB
llama_new_context_with_model: KV self size  =    2.25 MiB, K (f16):    1.12 MiB, V (f16):    1.12 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.98 MiB
llama_new_context_with_model:        CPU compute buffer size =    72.88 MiB
llama_new_context_with_model: graph nodes  = 601
llama_new_context_with_model: graph splits = 1

Looking at these logs we can see that n_ctx = 128, n_batch = 74, n_ubatch = 74.
I then thought to apply big values of these parameters into my code so that I could manage a bigger prompt.
My ask() method original code is:

bool Llama2Device::ask(const std::string &question, yarp::dev::LLM_Message &oAnswer)
{
   model_question = question;
   // if prompt is set, add it to the question
   if(prompt_set == true){
       model_question += " " + m_prompt.content;
   }

   // add question to the conversation
   yarp::dev::LLM_Message message;
   message.type = "user";
   message.content = question;
   m_conversation.push_back(message);

   // variable used to store the complete answer generated by the model
   std::string final_output;

   // tokenize the prompt
   // find the number of tokens in the prompt
   const int n_prompt = -llama_tokenize(model, model_question.c_str(), model_question.size(), NULL, 0, true, true);
   // allocate space for the tokens and tokenize the prompt
   std::vector<llama_token> prompt_tokens(n_prompt);
   if(llama_tokenize(model, model_question.c_str(), model_question.size(), prompt_tokens.data(), prompt_tokens.size(), true, true) < 0){
       yCError(LLAMA2DEVICE) << "Error: failed to tokenize the prompt";
       return false;
   }
   else{
       yCInfo(LLAMA2DEVICE) << "Prompt tokenized correctly";
   }

   // initialize the context
   llama_context_params ctx_params = llama_context_default_params();
   // n_ctx is the context size
   ctx_params.n_ctx = n_prompt + n_predict -1;
   // n_batch is the maximum number of tokens that can be processed in a single call to llama_decode
   ctx_params.n_batch = n_prompt;
   // enable performance counters
   ctx_params.no_perf = false;

   llama_context * ctx = llama_new_context_with_model(model, ctx_params);

   // check if context has been initialized correctly
   if(ctx == NULL){
       yCError(LLAMA2DEVICE) << "Error: failed to create the llama_context";
       return false;
   }
   else{
       yCInfo(LLAMA2DEVICE) << "Context correctly initialized";
   }

   // initialize the sampler
   auto sparams = llama_sampler_chain_default_params();
   sparams.no_perf = false;
   llama_sampler * smpl = llama_sampler_chain_init(sparams);

   llama_sampler_chain_add(smpl, llama_sampler_init_greedy());

   //print the prompt token by token
   for(auto id: prompt_tokens){
       char buf[128];
       int n = llama_token_to_piece(model, id, buf, sizeof(buf), 0, true);
       if(n < 0){
           yCError(LLAMA2DEVICE) << "Error: failed to convert token to piece";
           return false;
       }
   }

   // prepare a batch for the prompt
   llama_batch batch = llama_batch_get_one(prompt_tokens.data(), prompt_tokens.size());

   // main loop
   const auto t_main_start = ggml_time_us();
   int n_decode = 0;
   llama_token new_token_id;

   for (int n_pos = 0; n_pos + batch.n_tokens < n_prompt + n_predict;) {
       // evaluate the current batch with the transformer model
       if(llama_decode(ctx, batch)){
           yCError(LLAMA2DEVICE) << "Error: failed to eval";
           return false;
       }

       n_pos += batch.n_tokens;

       // sample the next token
       {
           new_token_id = llama_sampler_sample(smpl, ctx, -1);

           // check if it is the end of a generation
           if(llama_token_is_eog(model, new_token_id)){
               break;
           }

           char buf[128];
           int n = llama_token_to_piece(model, new_token_id, buf, sizeof(buf), 0, true);
           if(n < 0){
               yCError(LLAMA2DEVICE) << "Error: failed to convert token to piece";
               return false;
           }

           // accumulate tokens
           final_output += std::string(buf, n);

           // prepare the next batch with the sampled token
           batch = llama_batch_get_one(&new_token_id, 1);

           n_decode += 1;

           if (m_progress_bar == 1){
           // compute progress bar percentage
           int progress = static_cast<int>(100.0 * n_decode / n_predict);

           // print progress bar
           std::string progress_bar = "[";
           int bar_width = 50;  // progress bar length

           int pos = bar_width * progress / 100;
           for (int i = 0; i < bar_width; ++i) {
               if (i < pos) progress_bar += "=";
               else if (i == pos) progress_bar += ">";
               else progress_bar += " ";
           }
           progress_bar += "] " + std::to_string(progress) + "%";

           yCInfo(LLAMA2DEVICE) << progress_bar;
           }
       }
   }

   yCInfo(LLAMA2DEVICE) << final_output;

   const auto t_main_end = ggml_time_us();

   // add model answer to the conversation 
   message.type = "assitant";
   message.content = final_output;
   m_conversation.push_back(message);

   // write model answer inside oAnswer
   oAnswer.type = "assistant";
   oAnswer.content = final_output;

   return true;
}

Like this the code returns an empty answer so I forced the values of n_ctx, n_batch and n_ubatch in this way:

ctx_params.n_ctx = 4096;
ctx_params.n_batch = 1024;
ctx_params.n_ubatch = 1024;

The problem is that the model still replies with an empty answer.
The methods work with rpc and here is how I called them and set the prompt and asked the question:

>>setPrompt "You are an automated answerer. You can only answer yes or no. If a question cannot be answered with neither yes nor no, you will just answer Does not compute. Remember: think about the question a if it does not make sense to answer ith yes or no, answer Does not compute."
Response: [ok]
>>ask "Am I a human being?"
Response: [ok] (assistant "" () ())

The log of the device is this:

[INFO] |yarp.llama2Device| Prompt tokenized correctly
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 1024
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    72.00 MiB
llama_new_context_with_model: KV self size  =   72.00 MiB, K (f16):   36.00 MiB, V (f16):   36.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 3009.50 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 24.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  3009.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 601
llama_new_context_with_model: graph splits = 202
[INFO] |yarp.llama2Device| Context correctly initialized
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
[INFO] |yarp.llama2Device|

The warning "ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture" was printed also with a shorter prompt. In that case it uses the gpu (NVIDIA GeForce GTX 1060 6GB) and returns an answer which is not empty.
So the problem I am encountering appears when the prompt is a bit longer and remains even after defining a bigger context size.
I was wondering if the "-c 4096" flag (which worked on llama-simple) modifies only the value of n_ctx or if it modifies also other parameters which I am not assigning.

These are the infos about the model I am using:

llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-2b-it
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                          gemma.block_count u32              = 18
llama_model_loader: - kv   4:                     gemma.embedding_length u32              = 2048
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 16384
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 8
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 1
llama_model_loader: - kv   8:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv   9:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  10:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  14:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  15:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,256128]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,256128]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,256128]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - type  f32:  164 tensors
llm_load_vocab: control-looking token:    107 '<end_of_turn>' was not control-type; this is probably a bug in the model. its type will be overridden
llm_load_vocab: control token:      2 '<bos>' is not marked as EOG
llm_load_vocab: control token:      0 '<pad>' is not marked as EOG
llm_load_vocab: control token:      1 '<eos>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256128
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 18
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 1
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 16384
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = all F32 (guessed)
llm_load_print_meta: model params     = 2.51 B
llm_load_print_meta: model size       = 9.34 GiB (32.00 BPW)
llm_load_print_meta: general.name     = gemma-2b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOG token        = 1 '<eos>'
llm_load_print_meta: EOG token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 93
llm_load_tensors: ggml ctx size =    0.08 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/19 layers to GPU
llm_load_tensors:        CPU buffer size =  9561.29 MiB

I honestly have no idea on how to solve this problem.
Do you have any ideas about where the problem could be and how to fix it?

Thanks in advance

First Bad Commit

No response

Relevant log output

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant