-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llamacpp with cpp backend #2527
Llamacpp with cpp backend #2527
Conversation
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
…ml version Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! Left a couple of comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to create a CMakeLists.txt in the llamacpp directory and use add_subdirectory() in the main file to avoid the main one to getting too crowded.
@@ -5,3 +5,26 @@ list(APPEND MNIST_SOURCE_FILES ${MNIST_SRC_DIR}/mnist_handler.cc) | |||
add_library(mnist_handler SHARED ${MNIST_SOURCE_FILES}) | |||
target_include_directories(mnist_handler PUBLIC ${MNIST_SRC_DIR}) | |||
target_link_libraries(mnist_handler PRIVATE ts_backends_torch_scripted ts_utils ${TORCH_LIBRARIES}) | |||
|
|||
set(LLM_SRC_DIR "${torchserve_cpp_SOURCE_DIR}/src/examples/llamacpp") | |||
set(LLAMACPP_SRC_DIR "/home/ubuntu/llama.cpp") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to avoid absolute paths. Is the file included in the PR? What is the license of llama.cpp? Do we need to include the license file?
target_link_libraries(llamacpp_handler PRIVATE ts_backends_torch_scripted ts_utils ${TORCH_LIBRARIES}) | ||
|
||
|
||
set(MY_OBJECT_FILES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are the src files to these obj files?
@@ -0,0 +1,5 @@ | |||
{ | |||
"checkpoint_path" : "/home/ubuntu/llama-2-7b-chat.Q4_0.gguf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto also: How big is this file?
namespace llm { | ||
|
||
void LlamacppHandler::initialize_context() { | ||
llama_ctx = llama_new_context_with_model(llamamodel, ctx_params); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this defined?
std::shared_ptr<torch::Device>& device, | ||
std::pair<std::string&, std::map<uint8_t, std::string>&>& idx_to_req_id, | ||
std::shared_ptr<torchserve::InferenceResponseBatch>& response_batch) { | ||
auto tokens_list_tensor = inputs[0].toTensor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you implement processing of the whole batch? Serialized in a for loop would be fine for now. Batched processing would be even better if possible with llama.cpp
|
||
std::vector<llama_token> tokens_list; | ||
|
||
for (auto id : long_vector) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we jump through so many loop here? Cant we directly write the tokens_list from the tensor? Or can you create an array using the data_ptr as underlying storage without making a copy?
} | ||
const int n_gen = std::min(32, max_context_size); | ||
|
||
while (llama_get_kv_cache_token_count(llama_ctx) < n_gen) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I read this correctly that the maximum number of tokens (including context) will be 32?
llama_token new_token_id = 0; | ||
|
||
auto logits = llama_get_logits(llama_ctx); | ||
auto n_vocab = llama_n_vocab(llama_ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to avoid auto for primitive datatypes when readability is not suffering.
|
||
torch::Tensor stacked_tensor = torch::stack(tensor_vector); | ||
llama_print_timings(llama_ctx); | ||
llama_free(llama_ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the model be reused? Then this should be moved into the destructor.
Signed-off-by: Shrinath Suresh <[email protected]>
@mreso Thanks for your review comments. I have already addressed few of your comments - implementing destructor, batch processing, remove |
This feature was picked up in v0.10.0 task. |
Description
Benchmarking LLM deployment with CPP Backend
Setup and Test
Follow the instructions from README.md to set up the environment
Download the TheBloke/Llama-2-7B-Chat-GGML model.
and update the path of the model in the script here.
To control the number of parameters to be generated, update the
max_context_size
variable in the script to the desired number.Note: In the next version, this step will be changed to read the llm path from config.
Once the build is successful
libllm_handler.so
shared object file would be generated inserve/cpp/./test/resources/torchscript_model/llm/llm_handler
folder.dummy.pt
file to thellm_handler
folder.llm_handler
folder and run the following command to generate mar fileThe default timeout is 120000. When the context size is 512, LLM generation takes more time to complete the request in the single gpu machine.
prompt.txt
if needed and runType of change
Please delete options that are not relevant.
Feature/Issue validation/testing
Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Test A
Logs for Test A
Test B
Logs for Test B
Checklist: