-
Notifications
You must be signed in to change notification settings - Fork 202
Open
Description
Description
I am trying to load the Phi-2 model using the Hugging Face runtime, but I am encountering an Out of Memory (OOM) error. The GPU I am using is a Tesla T4 with 16GB of memory. Interestingly, the same configuration works fine when using vLLM to load the model.
Environment
- GPU: Tesla T4 (16GB)
- Model: Phi-2
- Runtime: Hugging Face Transformers
- Comparison: vLLM works without issues under the same configuration.
Error Details
Here is the specific error message I encountered:
2025-01-07 02:36:12,240 [mlserver.parallel] DEBUG - Starting response processing loop...
2025-01-07 02:36:12,241 [mlserver.rest] INFO - HTTP server running on http://0.0.0.0:8080
INFO: Started server process [18]
INFO: Waiting for application startup.
2025-01-07 02:36:12,297 [mlserver.metrics] INFO - Metrics server running on http://0.0.0.0:8082
2025-01-07 02:36:12,298 [mlserver.metrics] INFO - Prometheus scraping endpoint can be accessed on http://0.0.0.0:8082/metrics
INFO: Started server process [18]
INFO: Waiting for application startup.
2025-01-07 02:36:15.091931: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-01-07 02:36:15.156740: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-07 02:36:15.156806: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-07 02:36:15.160315: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-07 02:36:15.176303: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-07 02:36:16.599813: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/opt/conda/lib/python3.10/site-packages/diffusers/models/vq_model.py:20: FutureWarning: `VQEncoderOutput` is deprecated and will be removed in version 0.31. Importing `VQEncoderOutput` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from diffusers.models.autoencoders.vq_model import VQEncoderOutput`, instead.
deprecate("VQEncoderOutput", "0.31", deprecation_message)
/opt/conda/lib/python3.10/site-packages/diffusers/models/vq_model.py:25: FutureWarning: `VQModel` is deprecated and will be removed in version 0.31. Importing `VQModel` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from diffusers.models.autoencoders.vq_model import VQModel`, instead.
deprecate("VQModel", "0.31", deprecation_message)
INFO: Application startup complete.
2025-01-07 02:36:20,361 [mlserver.grpc] INFO - gRPC server running on http://0.0.0.0:8081
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
INFO: Uvicorn running on http://0.0.0.0:8082 (Press CTRL+C to quit)
2025-01-07 02:36:22.753498: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-01-07 02:36:22.812573: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-07 02:36:22.812633: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-07 02:36:22.814742: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-07 02:36:22.826697: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-07 02:36:24.104666: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/opt/conda/lib/python3.10/site-packages/diffusers/models/vq_model.py:20: FutureWarning: `VQEncoderOutput` is deprecated and will be removed in version 0.31. Importing `VQEncoderOutput` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from diffusers.models.autoencoders.vq_model import VQEncoderOutput`, instead.
deprecate("VQEncoderOutput", "0.31", deprecation_message)
/opt/conda/lib/python3.10/site-packages/diffusers/models/vq_model.py:25: FutureWarning: `VQModel` is deprecated and will be removed in version 0.31. Importing `VQModel` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from diffusers.models.autoencoders.vq_model import VQModel`, instead.
deprecate("VQModel", "0.31", deprecation_message)
2025-01-07 02:36:27,755 [mlserver][inferapi-phi-fy-mlserver] INFO - Loading model for pipeline task 'text-generation'...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:49<00:00, 24.70s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/sLoading checkpoint shards: 50%|█████████████████████████████████████████████████████████████▌ | 1/2 [00:18<00:18, 18.65s/itLoading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:21<00:00, 9.07s/itLoading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:21<00:00, 10.51s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2025-01-07 02:37:46,426 [mlserver][inferapi-phi-fy-mlserver] INFO - Couldn't load model 'inferapi-phi-fy-mlserver'. Model will be removed from registry.
2025-01-07 02:37:46,426 [mlserver.parallel] ERROR - An error occurred processing a model update of type 'Load'.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 167, in _load_model
model.ready = await model.load()
File "/opt/conda/lib/python3.10/site-packages/mlserver_huggingface/runtime.py", line 93, in load
self._model = load_pipeline_from_settings(self.hf_settings, self.settings)
File "/opt/conda/lib/python3.10/site-packages/mlserver_huggingface/common.py", line 69, in load_pipeline_from_settings
hf_pipeline = pipeline(
File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 1107, in pipeline
return pipeline_class(model=model, framework=framework, task=task, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 84, in __init__
super().__init__(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/base.py", line 874, in __init__
self.model.to(self.device)
File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2556, in to
return super().to(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
return self._apply(convert)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
param_applied = fn(param)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacity of 14.57 GiB of which 10.75 MiB is free. Process 15003 has 14.55 GiB memory in use. Of the allocated memory 14.46 GiB is allocated by PyTorch, and 1.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/worker.py", line 158, in _process_model_update
await self._model_registry.load(model_settings)
File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 299, in load
return await self._models[model_settings.name].load(model_settings)
File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 150, in load
await self._load_model(new_model)
File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 175, in _load_model
await self._unload_model(model)
File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 222, in _unload_model
model.ready = not await model.unload()
File "/opt/conda/lib/python3.10/site-packages/mlserver_huggingface/runtime.py", line 130, in unload
is_torch = self._model.framework == "pt"
AttributeError: 'HuggingFaceRuntime' object has no attribute '_model'
2025-01-07 02:37:46,845 [mlserver][inferapi-phi-fy-mlserver] INFO - Couldn't load model 'inferapi-phi-fy-mlserver'. Model will be removed from registry.
2025-01-07 02:37:46,853 [mlserver.parallel] ERROR - An error occurred processing a model update of type 'Unload'.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/worker.py", line 160, in _process_model_update
await self._model_registry.unload_version(
File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 308, in unload_version
await model_registry.unload_version(version)
File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 204, in unload_version
model = await self.get_model(version)
File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 243, in get_model
raise ModelNotFound(self._name, version)
mlserver.errors.ModelNotFound: Model inferapi-phi-fy-mlserver not found
2025-01-07 02:37:46,855 [mlserver] ERROR - Some of the models failed to load during startup!
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 163, in _load_model
model = await callback(model)
File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/registry.py", line 171, in load_model
loaded = await pool.load_model(model)
File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/pool.py", line 171, in load_model
await self._dispatcher.dispatch_update(load_message)
File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 229, in dispatch_update
return await asyncio.gather(
File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 244, in dispatch_update_to_worker
return await self._async_responses.schedule_and_wait(worker_update, worker)
File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 59, in schedule_and_wait
return await self._wait(message_id)
File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 87, in _wait
response_message = await future
mlserver.parallel.errors.WorkerError: builtins.AttributeError: 'HuggingFaceRuntime' object has no attribute '_model'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/mlserver/server.py", line 125, in start
await asyncio.gather(
File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 299, in load
return await self._models[model_settings.name].load(model_settings)
File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 150, in load
await self._load_model(new_model)
File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 175, in _load_model
await self._unload_model(model)
File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 222, in _unload_model
model.ready = not await model.unload()
File "/opt/conda/lib/python3.10/site-packages/mlserver_huggingface/runtime.py", line 130, in unload
is_torch = self._model.framework == "pt"
AttributeError: 'HuggingFaceRuntime' object has no attribute '_model'
2025-01-07 02:37:46,860 [mlserver.parallel] INFO - Waiting for shutdown of default inference pool...
2025-01-07 02:37:51,748 [mlserver.parallel] INFO - Shutdown of default inference pool complete
2025-01-07 02:37:51,748 [mlserver.grpc] INFO - Waiting for gRPC server shutdown
2025-01-07 02:37:51,752 [mlserver.grpc] INFO - gRPC server shutdown complete
INFO: Shutting down
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [18]
INFO: Application shutdown complete.
INFO: Finished server process [18]
2025-01-07 02:37:51,950 [mlserver.parallel] INFO - Waiting for shutdown of default inference pool...
2025-01-07 02:37:51,951 [mlserver.parallel] INFO - Shutdown of default inference pool complete
2025-01-07 02:37:51,951 [mlserver.parallel] INFO - Waiting for shutdown of default inference pool...
2025-01-07 02:37:51,951 [mlserver.parallel] INFO - Shutdown of default inference pool complete
2025-01-07 02:37:51,951 [mlserver.parallel] INFO - Waiting for shutdown of default inference pool...
2025-01-07 02:37:51,951 [mlserver.parallel] INFO - Shutdown of default inference pool complete
Questions:
- According to my research, the Phi-2 model should fit within 16GB of GPU memory (especially with fp16 precision). Why does the Hugging Face runtime still result in an OOM error?
- Are there any recommended settings or best practices to reduce memory usage in the Hugging Face runtime for large models like Phi-2?
Metadata
Metadata
Assignees
Labels
No labels