Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update vllm to 0.6.2 #3343

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Update vllm to 0.6.2 #3343

wants to merge 1 commit into from

Conversation

mreso
Copy link
Collaborator

@mreso mreso commented Oct 7, 2024

Description

This PR updates vllm to version 0.6.2
Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • [X]
# Make sure to install torchserve with pip or conda as described above and login with `huggingface-cli login`
python -m ts.llm_launcher --model_id meta-llama/Llama-3.2-3B-Instruct --disable_token_auth

# Try it out
curl -X POST -d '{"model":"meta-llama/Llama-3.2-3B-Instruct", "prompt":"Hello, my name is", "max_tokens": 200}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model/1.0/v1/completions"

TS logs

CUDA_VISIBLE_DEVICES=0 python -m ts.llm_launcher --model_id meta-llama/Llama-3.2-3B-Instruct --disable_token_auth
TorchServe is not currently running.
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-10-07T22:54:40,120 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2024-10-07T22:54:40,133 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2024-10-07T22:54:40,201 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/ts/configs/metrics.yaml
2024-10-07T22:54:40,293 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.12.0
TS Home: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve
Current directory: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve
Temp directory: /tmp
Metrics config path: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/ts/configs/metrics.yaml
Number of GPUs: 1
Number of CPUs: 2
Max heap size: 30688 M
Python executable: /data/home/mreso/miniconda3/envs/serve/bin/python
Config file: N/A
Inference address: http://127.0.0.1:8080                                                                                                                                                                                                                                                                                                                                  Management address: http://127.0.0.1:8081                                                                                                                                                                                                                                                                                                                                 Metrics address: http://127.0.0.1:8082                                                                                                                                                                                                                                                                                                                                    Model Store: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/model_store
Initial Models: model
Log dir: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/logs
Metrics dir: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/model_store
CPP log config: N/A
Model config: N/A
System metrics command: default
Model API enabled: false
2024-10-07T22:54:40,307 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2024-10-07T22:54:40,330 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: model
2024-10-07T22:54:40,339 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createTempDir /tmp/models/d48654464d2445df817bc31407c16795
2024-10-07T22:54:40,342 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createSymbolicDir /tmp/models/d48654464d2445df817bc31407c16795/model
2024-10-07T22:54:40,356 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model model
2024-10-07T22:54:40,356 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model model
2024-10-07T22:54:40,357 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model model loaded.
2024-10-07T22:54:40,357 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: model, count: 1
2024-10-07T22:54:40,368 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-10-07T22:54:40,381 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncWorkerThread - Device Ids: null
2024-10-07T22:54:40,403 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/data/home/mreso/miniconda3/envs/serve/bin/python, /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mre
so/serve/ts/configs/metrics.yaml, --async]
2024-10-07T22:54:40,472 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2024-10-07T22:54:40,473 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2024-10-07T22:54:40,485 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2024-10-07T22:54:40,486 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2024-10-07T22:54:40,488 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2024-10-07T22:54:40,908 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2024-10-07T22:54:43,022 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - s_name_part0=/tmp/.ts.sock, s_name_part1=9000, pid=302980
2024-10-07T22:54:43,027 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9000
2024-10-07T22:54:43,032 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Successfully loaded /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/ts/configs/metrics.yaml.
2024-10-07T22:54:43,032 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - [PID]302980
2024-10-07T22:54:43,032 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Torch worker started.
2024-10-07T22:54:43,033 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Python runtime: 3.10.14
2024-10-07T22:54:43,033 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-model_1.0 State change null -> WORKER_STARTED
2024-10-07T22:54:43,037 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncWorkerThread - Connecting to: /tmp/.ts.sock.9000
2024-10-07T22:54:43,044 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9000.
2024-10-07T22:54:43,044 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - handle_connection_async
2024-10-07T22:54:43,047 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncBatchAggregator - Getting requests from model: org.pytorch.serve.wlm.Model@c4fd1e6
2024-10-07T22:54:43,047 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncBatchAggregator - Adding job to jobs: f2015b28-d4c1-49b8-8888-e7a51988461c
2024-10-07T22:54:43,048 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncWorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1728341683048
2024-10-07T22:54:43,088 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncWorkerThread - Successfully flushed req
2024-10-07T22:54:43,089 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1
2024-10-07T22:54:43,286 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:1.5|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341683
2024-10-07T22:54:43,291 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:542.3601226806641|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341683
2024-10-07T22:54:43,291 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:77.70912551879883|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341683
2024-10-07T22:54:43,291 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:12.5|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341683
2024-10-07T22:54:43,292 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.0012261062543680035|#Level:Host,DeviceId:0|#hostname:cr6-p548xlarge-3,timestamp:1728341683
2024-10-07T22:54:43,292 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:1.0|#Level:Host,DeviceId:0|#hostname:cr6-p548xlarge-3,timestamp:1728341683
2024-10-07T22:54:43,292 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:cr6-p548xlarge-3,timestamp:1728341683
2024-10-07T22:54:43,293 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:1847870.75390625|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341683
2024-10-07T22:54:43,293 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:185217.56640625|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341683
2024-10-07T22:54:43,293 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:9.8|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341683
2024-10-07T22:54:44,080 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - WARNING 10-07 22:54:44 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
2024-10-07T22:54:54,296 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Enabled tensor cores
2024-10-07T22:54:54,296 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - OpenVINO is not enabled
2024-10-07T22:54:54,297 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - proceeding without onnxruntime
2024-10-07T22:54:54,297 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2024-10-07T22:54:54,594 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - WARNING 10-07 22:54:54 arg_utils.py:930] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
2024-10-07T22:54:54,594 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 10-07 22:54:54 config.py:1010] Chunked prefill is enabled with max_num_batched_tokens=512.
2024-10-07T22:54:54,595 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 10-07 22:54:54 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='meta-llama/Llama-3.2-3B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.2-3B-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
2024-10-07T22:54:57,365 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 10-07 22:54:57 model_runner.py:1014] Starting to load model meta-llama/Llama-3.2-3B-Instruct...
2024-10-07T22:54:57,843 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 10-07 22:54:57 weight_utils.py:242] Using model weights format ['*.safetensors']
2024-10-07T22:55:20,068 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -
2024-10-07T22:55:20,069 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
2024-10-07T22:55:20,784 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -
2024-10-07T22:55:20,784 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.40it/s]
2024-10-07T22:55:23,193 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -
2024-10-07T22:55:23,194 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.71s/it]
2024-10-07T22:55:23,194 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -
2024-10-07T22:55:23,194 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.56s/it]
2024-10-07T22:55:23,194 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -
2024-10-07T22:55:23,561 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 10-07 22:55:23 model_runner.py:1025] Loading model weights took 6.0160 GB
2024-10-07T22:55:24,136 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 10-07 22:55:24 gpu_executor.py:122] # GPU blocks: 37315, # CPU blocks: 2340
2024-10-07T22:55:28,609 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 10-07 22:55:28 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-10-07T22:55:28,615 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 10-07 22:55:28 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
2024-10-07T22:55:37,707 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 10-07 22:55:37 model_runner.py:1456] Graph capturing finished in 9 secs.
2024-10-07T22:55:37,860 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.AsyncBatchAggregator - Predictions is empty. This is from initial load....
2024-10-07T22:55:37,860 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.AsyncWorkerThread - Worker loaded the model successfully
2024-10-07T22:55:37,861 [DEBUG] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - W-9000-model_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2024-10-07T22:55:37,861 [INFO ] epollEventLoopGroup-5-1 TS_METRICS - WorkerLoadTime.Milliseconds:57497.0|#WorkerName:W-9000-model_1.0,Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341737
2024-10-07T22:55:37,862 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncBatchAggregator - Getting requests from model: org.pytorch.serve.wlm.Model@c4fd1e6
2024-10-07T22:55:41,891 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.8|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341741
2024-10-07T22:55:41,895 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:542.3582725524902|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341741
2024-10-07T22:55:41,896 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:77.71097564697266|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341741
2024-10-07T22:55:41,896 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:12.5|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341741
2024-10-07T22:55:41,896 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:88.9981485795559|#Level:Host,DeviceId:0|#hostname:cr6-p548xlarge-3,timestamp:1728341741
2024-10-07T22:55:41,896 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:72586.0|#Level:Host,DeviceId:0|#hostname:cr6-p548xlarge-3,timestamp:1728341741
2024-10-07T22:55:41,896 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:cr6-p548xlarge-3,timestamp:1728341741
2024-10-07T22:55:41,897 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:1839460.04296875|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341741
2024-10-07T22:55:41,897 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:186432.0078125|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341741
2024-10-07T22:55:41,897 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:10.2|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341741
2024-10-07T22:56:27,310 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:model,model_version:1.0|#hostname:cr6-p548xlarge-3,timestamp:1728341787
2024-10-07T22:56:27,313 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncBatchAggregator - Adding job to jobs: 343467a3-e1d8-41d8-9b8a-0defae96b179
2024-10-07T22:56:27,313 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncWorkerThread - Flushing req.cmd PREDICT repeats 1 to backend at: 1728341787313
2024-10-07T22:56:27,315 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend received inference at: 1728341787
2024-10-07T22:56:27,316 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncWorkerThread - Successfully flushed req
2024-10-07T22:56:27,316 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncBatchAggregator - Getting requests from model: org.pytorch.serve.wlm.Model@c4fd1e6
2024-10-07T22:56:30,052 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self._entry_point=<bound method VLLMHandler.handle of <ts.torch_handler.vllm_handler.VLLMHandler object at 0x7fdff77840d0>>
2024-10-07T22:56:30,053 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - PyTorch version 2.4.0 available.
2024-10-07T22:56:32,560 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 10-07 22:56:32 async_llm_engine.py:204] Added request cmpl-c40de40c9f4b4376847401d10131509a-0.
2024-10-07T22:56:32,830 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 10-07 22:56:32 metrics.py:351] Avg prompt throughput: 0.1 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
2024-10-07T22:56:33,280 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 10-07 22:56:33 async_llm_engine.py:172] Finished request cmpl-c40de40c9f4b4376847401d10131509a-0.
2024-10-07T22:56:33,290 [INFO ] W-9000-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]HandlerTime.Milliseconds:5971.32|#ModelName:model,Level:Model|#type:GAUGE|#hostname:cr6-p548xlarge-3,1728341793,343467a3-e1d8-41d8-9b8a-0defae96b179, pattern=[METRICS]
2024-10-07T22:56:33,291 [INFO ] W-9000-model_1.0-stdout MODEL_METRICS - HandlerTime.ms:5971.32|#ModelName:model,Level:Model|#hostname:cr6-p548xlarge-3,requestID:343467a3-e1d8-41d8-9b8a-0defae96b179,timestamp:1728341793
2024-10-07T22:56:33,293 [INFO ] W-9000-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]PredictionTime.Milliseconds:5971.46|#ModelName:model,Level:Model|#type:GAUGE|#hostname:cr6-p548xlarge-3,1728341793,343467a3-e1d8-41d8-9b8a-0defae96b179, pattern=[METRICS]
2024-10-07T22:56:33,293 [INFO ] W-9000-model_1.0-stdout MODEL_METRICS - PredictionTime.ms:5971.46|#ModelName:model,Level:Model|#hostname:cr6-p548xlarge-3,requestID:343467a3-e1d8-41d8-9b8a-0defae96b179,timestamp:1728341793
2024-10-07T22:56:33,293 [INFO ] epollEventLoopGroup-5-1 ACCESS_LOG - /127.0.0.1:56734 "POST /predictions/model/1.0/v1/completions HTTP/1.1" 200 5985
2024-10-07T22:56:33,295 [INFO ] epollEventLoopGroup-5-1 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341793
2024-10-07T22:56:33,296 [INFO ] epollEventLoopGroup-5-1 TS_METRICS - ts_inference_latency_microseconds.Microseconds:5980157.342|#model_name:model,model_version:1.0|#hostname:cr6-p548xlarge-3,timestamp:1728341793
2024-10-07T22:56:33,296 [INFO ] epollEventLoopGroup-5-1 TS_METRICS - ts_queue_latency_microseconds.Microseconds:0.0|#model_name:model,model_version:1.0|#hostname:cr6-p548xlarge-3,timestamp:1728341793
2024-10-07T22:56:33,296 [DEBUG] epollEventLoopGroup-5-1 org.pytorch.serve.job.RestJob - Waiting time ns: 0, Backend time ns: 5983791646
2024-10-07T22:56:33,296 [INFO ] epollEventLoopGroup-5-1 TS_METRICS - QueueTime.Milliseconds:0.0|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1728341793

Client logs

{
  "id": "cmpl-c40de40c9f4b4376847401d10131509a",
  "object": "text_completion",
  "created": 1728341787,
  "model": "meta-llama/Llama-3.2-3B-Instruct",
  "choices": [
    {
      "index": 0,
      "text": " Helen and I am Assistant to the CEO, I-------------QA (Quality Assurance) to guarantee the players will have the best gaming experiences.\n\n## Step 1: Write a friendly and welcoming message\nHello, my name is Helen and I am the Assistant to the CEO, responsible for the QA (Quality Assurance) team.\n\nThe final answer is: Helen. I am the Assistant to the CEO, responsible for the QA (Quality Assurance) team.",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 6,
    "total_tokens": 97,
    "completion_tokens": 91
  }
}

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@agunapal agunapal self-requested a review December 19, 2024 20:03
Copy link
Collaborator

@agunapal agunapal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants