You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Which llama.cpp modules do you know to be affected?
libllama (core library)
Command line
No response
Problem description & steps to reproduce
I have a YARP (yet another robot platform) device which implements an ILLM yarp interface which implements these methods:
setPrompt
readPrompt
ask
getConversation
deleteConversation
help
The device uses llama.cpp library (version corresponding to commit 167a515) to ask questions to a .gguf model.
The code responsible for answer generation is implemented inside the ask method and follows the same logic of the code implemented inside the "simple" example of llamacpp.
Everything works perfectly until an input longer than 70 tokens is provided to the model. In such case the library will immediately encounter an EOG token and return an empty answer.
The prompt that creates this problem is:
You are an automated answerer. You can only answer yes or no. If a question cannot be answered with neither yes nor no, you will just answer Does not compute. Remember: think about the question a if it does not make sense to answer ith yes or no, answer Does not compute. Am I a human being?
I tried running this prompt inside llama-simple by using this command:
./llama-simple -m /home/leonardo/Repos/yarp-device-llama2/models/gemma/gemma-2b-it.gguf "Am I a human being? You are an automated answerer. You can only answer yes or no. If a question cannot be answered with neither yes nor no, you will just answer Does not compute. Remember: think about the question a if it does not make sense to answer ith yes or no, answer Does not compute."
The result is that it immediately replies with an empty answer too. Okay this is fine since the context size probably is too small to manage a bigger prompt.
So, I found this issue and added the flag "-c 4096" at the end of the command:
./llama-simple -m /home/leonardo/Repos/yarp-device-llama2/models/gemma/gemma-2b-it.gguf "Am I a human being? You are an automated answerer. You can only answer yes or no. If a question cannot be answered with neither yes nor no, you will just answer Does not compute. Remember: think about the question a if it does not make sense to answer ith yes or no, answer Does not compute." -c 4096
And it computed and replied with a correct answer:
Sure, here's an answer to the question:
Does not compute.
Looking at the logs, these are the context values:
Looking at these logs we can see that n_ctx = 128, n_batch = 74, n_ubatch = 74.
I then thought to apply big values of these parameters into my code so that I could manage a bigger prompt.
My ask() method original code is:
bool Llama2Device::ask(const std::string &question, yarp::dev::LLM_Message &oAnswer)
{
model_question = question;
// if prompt is set, add it to the question
if(prompt_set == true){
model_question += " " + m_prompt.content;
}
// add question to the conversation
yarp::dev::LLM_Message message;
message.type = "user";
message.content = question;
m_conversation.push_back(message);
// variable used to store the complete answer generated by the model
std::string final_output;
// tokenize the prompt
// find the number of tokens in the prompt
const int n_prompt = -llama_tokenize(model, model_question.c_str(), model_question.size(), NULL, 0, true, true);
// allocate space for the tokens and tokenize the prompt
std::vector<llama_token> prompt_tokens(n_prompt);
if(llama_tokenize(model, model_question.c_str(), model_question.size(), prompt_tokens.data(), prompt_tokens.size(), true, true) < 0){
yCError(LLAMA2DEVICE) << "Error: failed to tokenize the prompt";
return false;
}
else{
yCInfo(LLAMA2DEVICE) << "Prompt tokenized correctly";
}
// initialize the context
llama_context_params ctx_params = llama_context_default_params();
// n_ctx is the context size
ctx_params.n_ctx = n_prompt + n_predict -1;
// n_batch is the maximum number of tokens that can be processed in a single call to llama_decode
ctx_params.n_batch = n_prompt;
// enable performance counters
ctx_params.no_perf = false;
llama_context * ctx = llama_new_context_with_model(model, ctx_params);
// check if context has been initialized correctly
if(ctx == NULL){
yCError(LLAMA2DEVICE) << "Error: failed to create the llama_context";
return false;
}
else{
yCInfo(LLAMA2DEVICE) << "Context correctly initialized";
}
// initialize the sampler
auto sparams = llama_sampler_chain_default_params();
sparams.no_perf = false;
llama_sampler * smpl = llama_sampler_chain_init(sparams);
llama_sampler_chain_add(smpl, llama_sampler_init_greedy());
//print the prompt token by token
for(auto id: prompt_tokens){
char buf[128];
int n = llama_token_to_piece(model, id, buf, sizeof(buf), 0, true);
if(n < 0){
yCError(LLAMA2DEVICE) << "Error: failed to convert token to piece";
return false;
}
}
// prepare a batch for the prompt
llama_batch batch = llama_batch_get_one(prompt_tokens.data(), prompt_tokens.size());
// main loop
const auto t_main_start = ggml_time_us();
int n_decode = 0;
llama_token new_token_id;
for (int n_pos = 0; n_pos + batch.n_tokens < n_prompt + n_predict;) {
// evaluate the current batch with the transformer model
if(llama_decode(ctx, batch)){
yCError(LLAMA2DEVICE) << "Error: failed to eval";
return false;
}
n_pos += batch.n_tokens;
// sample the next token
{
new_token_id = llama_sampler_sample(smpl, ctx, -1);
// check if it is the end of a generation
if(llama_token_is_eog(model, new_token_id)){
break;
}
char buf[128];
int n = llama_token_to_piece(model, new_token_id, buf, sizeof(buf), 0, true);
if(n < 0){
yCError(LLAMA2DEVICE) << "Error: failed to convert token to piece";
return false;
}
// accumulate tokens
final_output += std::string(buf, n);
// prepare the next batch with the sampled token
batch = llama_batch_get_one(&new_token_id, 1);
n_decode += 1;
if (m_progress_bar == 1){
// compute progress bar percentage
int progress = static_cast<int>(100.0 * n_decode / n_predict);
// print progress bar
std::string progress_bar = "[";
int bar_width = 50; // progress bar length
int pos = bar_width * progress / 100;
for (int i = 0; i < bar_width; ++i) {
if (i < pos) progress_bar += "=";
else if (i == pos) progress_bar += ">";
else progress_bar += " ";
}
progress_bar += "] " + std::to_string(progress) + "%";
yCInfo(LLAMA2DEVICE) << progress_bar;
}
}
}
yCInfo(LLAMA2DEVICE) << final_output;
const auto t_main_end = ggml_time_us();
// add model answer to the conversation
message.type = "assitant";
message.content = final_output;
m_conversation.push_back(message);
// write model answer inside oAnswer
oAnswer.type = "assistant";
oAnswer.content = final_output;
return true;
}
Like this the code returns an empty answer so I forced the values of n_ctx, n_batch and n_ubatch in this way:
The problem is that the model still replies with an empty answer.
The methods work with rpc and here is how I called them and set the prompt and asked the question:
>>setPrompt "You are an automated answerer. You can only answer yes or no. If a question cannot be answered with neither yes nor no, you will just answer Does not compute. Remember: think about the question a if it does not make sense to answer ith yes or no, answer Does not compute."
Response: [ok]
>>ask "Am I a human being?"
Response: [ok] (assistant "" () ())
The log of the device is this:
[INFO] |yarp.llama2Device| Prompt tokenized correctly
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 1024
llama_new_context_with_model: n_ubatch = 1024
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 72.00 MiB
llama_new_context_with_model: KV self size = 72.00 MiB, K (f16): 36.00 MiB, V (f16): 36.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.98 MiB
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 3009.50 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 24.01 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 3009.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB
llama_new_context_with_model: graph nodes = 601
llama_new_context_with_model: graph splits = 202
[INFO] |yarp.llama2Device| Context correctly initialized
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
[INFO] |yarp.llama2Device|
The warning "ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture" was printed also with a shorter prompt. In that case it uses the gpu (NVIDIA GeForce GTX 1060 6GB) and returns an answer which is not empty.
So the problem I am encountering appears when the prompt is a bit longer and remains even after defining a bigger context size.
I was wondering if the "-c 4096" flag (which worked on llama-simple) modifies only the value of n_ctx or if it modifies also other parameters which I am not assigning.
Name and Version
llama.cpp version 167a515
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
libllama (core library)
Command line
No response
Problem description & steps to reproduce
I have a YARP (yet another robot platform) device which implements an ILLM yarp interface which implements these methods:
The device uses llama.cpp library (version corresponding to commit 167a515) to ask questions to a .gguf model.
The code responsible for answer generation is implemented inside the ask method and follows the same logic of the code implemented inside the "simple" example of llamacpp.
Everything works perfectly until an input longer than 70 tokens is provided to the model. In such case the library will immediately encounter an EOG token and return an empty answer.
The prompt that creates this problem is:
I tried running this prompt inside llama-simple by using this command:
The result is that it immediately replies with an empty answer too. Okay this is fine since the context size probably is too small to manage a bigger prompt.
So, I found this issue and added the flag "-c 4096" at the end of the command:
And it computed and replied with a correct answer:
Looking at the logs, these are the context values:
Looking at these logs we can see that n_ctx = 128, n_batch = 74, n_ubatch = 74.
I then thought to apply big values of these parameters into my code so that I could manage a bigger prompt.
My ask() method original code is:
Like this the code returns an empty answer so I forced the values of n_ctx, n_batch and n_ubatch in this way:
The problem is that the model still replies with an empty answer.
The methods work with rpc and here is how I called them and set the prompt and asked the question:
The log of the device is this:
The warning "ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture" was printed also with a shorter prompt. In that case it uses the gpu (NVIDIA GeForce GTX 1060 6GB) and returns an answer which is not empty.
So the problem I am encountering appears when the prompt is a bit longer and remains even after defining a bigger context size.
I was wondering if the "-c 4096" flag (which worked on llama-simple) modifies only the value of n_ctx or if it modifies also other parameters which I am not assigning.
These are the infos about the model I am using:
I honestly have no idea on how to solve this problem.
Do you have any ideas about where the problem could be and how to fix it?
Thanks in advance
First Bad Commit
No response
Relevant log output
No response
The text was updated successfully, but these errors were encountered: