Out of Memory (OOM) Error When Loading Phi-2 Model with Hugging Face Runtime

Description
I am trying to load the Phi-2 model using the Hugging Face runtime, but I am encountering an Out of Memory (OOM) error. The GPU I am using is a Tesla T4 with 16GB of memory. Interestingly, the same configuration works fine when using vLLM to load the model.

Environment
- GPU: Tesla T4 (16GB)
- Model: Phi-2
- Runtime: Hugging Face Transformers
- Comparison: vLLM works without issues under the same configuration.

Error Details
Here is the specific error message I encountered:
```bash
2025-01-07 02:36:12,240 [mlserver.parallel] DEBUG - Starting response processing loop...
2025-01-07 02:36:12,241 [mlserver.rest] INFO - HTTP server running on http://0.0.0.0:8080
INFO:     Started server process [18]
INFO:     Waiting for application startup.
2025-01-07 02:36:12,297 [mlserver.metrics] INFO - Metrics server running on http://0.0.0.0:8082
2025-01-07 02:36:12,298 [mlserver.metrics] INFO - Prometheus scraping endpoint can be accessed on http://0.0.0.0:8082/metrics
INFO:     Started server process [18]
INFO:     Waiting for application startup.
2025-01-07 02:36:15.091931: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-01-07 02:36:15.156740: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-07 02:36:15.156806: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-07 02:36:15.160315: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-07 02:36:15.176303: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-07 02:36:16.599813: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/opt/conda/lib/python3.10/site-packages/diffusers/models/vq_model.py:20: FutureWarning: `VQEncoderOutput` is deprecated and will be removed in version 0.31. Importing `VQEncoderOutput` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from diffusers.models.autoencoders.vq_model import VQEncoderOutput`, instead.
  deprecate("VQEncoderOutput", "0.31", deprecation_message)
/opt/conda/lib/python3.10/site-packages/diffusers/models/vq_model.py:25: FutureWarning: `VQModel` is deprecated and will be removed in version 0.31. Importing `VQModel` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from diffusers.models.autoencoders.vq_model import VQModel`, instead.
  deprecate("VQModel", "0.31", deprecation_message)
INFO:     Application startup complete.
2025-01-07 02:36:20,361 [mlserver.grpc] INFO - gRPC server running on http://0.0.0.0:8081
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
INFO:     Uvicorn running on http://0.0.0.0:8082 (Press CTRL+C to quit)
2025-01-07 02:36:22.753498: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-01-07 02:36:22.812573: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-07 02:36:22.812633: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-07 02:36:22.814742: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-07 02:36:22.826697: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-07 02:36:24.104666: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/opt/conda/lib/python3.10/site-packages/diffusers/models/vq_model.py:20: FutureWarning: `VQEncoderOutput` is deprecated and will be removed in version 0.31. Importing `VQEncoderOutput` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from diffusers.models.autoencoders.vq_model import VQEncoderOutput`, instead.
  deprecate("VQEncoderOutput", "0.31", deprecation_message)
/opt/conda/lib/python3.10/site-packages/diffusers/models/vq_model.py:25: FutureWarning: `VQModel` is deprecated and will be removed in version 0.31. Importing `VQModel` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from diffusers.models.autoencoders.vq_model import VQModel`, instead.
  deprecate("VQModel", "0.31", deprecation_message)
2025-01-07 02:36:27,755 [mlserver][inferapi-phi-fy-mlserver] INFO - Loading model for pipeline task 'text-generation'...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:49<00:00, 24.70s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards:   0%|                                                                                                                                   | 0/2 [00:00<?, ?it/sLoading checkpoint shards:  50%|█████████████████████████████████████████████████████████████▌                                                             | 1/2 [00:18<00:18, 18.65s/itLoading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:21<00:00,  9.07s/itLoading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:21<00:00, 10.51s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2025-01-07 02:37:46,426 [mlserver][inferapi-phi-fy-mlserver] INFO - Couldn't load model 'inferapi-phi-fy-mlserver'. Model will be removed from registry.
2025-01-07 02:37:46,426 [mlserver.parallel] ERROR - An error occurred processing a model update of type 'Load'.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 167, in _load_model
    model.ready = await model.load()
  File "/opt/conda/lib/python3.10/site-packages/mlserver_huggingface/runtime.py", line 93, in load
    self._model = load_pipeline_from_settings(self.hf_settings, self.settings)
  File "/opt/conda/lib/python3.10/site-packages/mlserver_huggingface/common.py", line 69, in load_pipeline_from_settings
    hf_pipeline = pipeline(
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 1107, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 84, in __init__
    super().__init__(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/base.py", line 874, in __init__
    self.model.to(self.device)
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2556, in to
    return super().to(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacity of 14.57 GiB of which 10.75 MiB is free. Process 15003 has 14.55 GiB memory in use. Of the allocated memory 14.46 GiB is allocated by PyTorch, and 1.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/worker.py", line 158, in _process_model_update
    await self._model_registry.load(model_settings)
  File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 299, in load
    return await self._models[model_settings.name].load(model_settings)
  File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 150, in load
    await self._load_model(new_model)
  File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 175, in _load_model
    await self._unload_model(model)
  File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 222, in _unload_model
    model.ready = not await model.unload()
  File "/opt/conda/lib/python3.10/site-packages/mlserver_huggingface/runtime.py", line 130, in unload
    is_torch = self._model.framework == "pt"
AttributeError: 'HuggingFaceRuntime' object has no attribute '_model'
2025-01-07 02:37:46,845 [mlserver][inferapi-phi-fy-mlserver] INFO - Couldn't load model 'inferapi-phi-fy-mlserver'. Model will be removed from registry.
2025-01-07 02:37:46,853 [mlserver.parallel] ERROR - An error occurred processing a model update of type 'Unload'.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/worker.py", line 160, in _process_model_update
    await self._model_registry.unload_version(
  File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 308, in unload_version
    await model_registry.unload_version(version)
  File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 204, in unload_version
    model = await self.get_model(version)
  File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 243, in get_model
    raise ModelNotFound(self._name, version)
mlserver.errors.ModelNotFound: Model inferapi-phi-fy-mlserver not found
2025-01-07 02:37:46,855 [mlserver] ERROR - Some of the models failed to load during startup!
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 163, in _load_model
    model = await callback(model)
  File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/registry.py", line 171, in load_model
    loaded = await pool.load_model(model)
  File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/pool.py", line 171, in load_model
    await self._dispatcher.dispatch_update(load_message)
  File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 229, in dispatch_update
    return await asyncio.gather(
  File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 244, in dispatch_update_to_worker
    return await self._async_responses.schedule_and_wait(worker_update, worker)
  File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 59, in schedule_and_wait
    return await self._wait(message_id)
  File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 87, in _wait
    response_message = await future
mlserver.parallel.errors.WorkerError: builtins.AttributeError: 'HuggingFaceRuntime' object has no attribute '_model'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/mlserver/server.py", line 125, in start
    await asyncio.gather(
  File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 299, in load
    return await self._models[model_settings.name].load(model_settings)
  File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 150, in load
    await self._load_model(new_model)
  File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 175, in _load_model
    await self._unload_model(model)
  File "/opt/conda/lib/python3.10/site-packages/mlserver/registry.py", line 222, in _unload_model
    model.ready = not await model.unload()
  File "/opt/conda/lib/python3.10/site-packages/mlserver_huggingface/runtime.py", line 130, in unload
    is_torch = self._model.framework == "pt"
AttributeError: 'HuggingFaceRuntime' object has no attribute '_model'
2025-01-07 02:37:46,860 [mlserver.parallel] INFO - Waiting for shutdown of default inference pool...
2025-01-07 02:37:51,748 [mlserver.parallel] INFO - Shutdown of default inference pool complete
2025-01-07 02:37:51,748 [mlserver.grpc] INFO - Waiting for gRPC server shutdown
2025-01-07 02:37:51,752 [mlserver.grpc] INFO - gRPC server shutdown complete
INFO:     Shutting down
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [18]
INFO:     Application shutdown complete.
INFO:     Finished server process [18]
2025-01-07 02:37:51,950 [mlserver.parallel] INFO - Waiting for shutdown of default inference pool...
2025-01-07 02:37:51,951 [mlserver.parallel] INFO - Shutdown of default inference pool complete
2025-01-07 02:37:51,951 [mlserver.parallel] INFO - Waiting for shutdown of default inference pool...
2025-01-07 02:37:51,951 [mlserver.parallel] INFO - Shutdown of default inference pool complete
2025-01-07 02:37:51,951 [mlserver.parallel] INFO - Waiting for shutdown of default inference pool...
2025-01-07 02:37:51,951 [mlserver.parallel] INFO - Shutdown of default inference pool complete
```

Questions:
- According to my research, the Phi-2 model should fit within 16GB of GPU memory (especially with fp16 precision). Why does the Hugging Face runtime still result in an OOM error?
- Are there any recommended settings or best practices to reduce memory usage in the Hugging Face runtime for large models like Phi-2?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Out of Memory (OOM) Error When Loading Phi-2 Model with Hugging Face Runtime #2023

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Out of Memory (OOM) Error When Loading Phi-2 Model with Hugging Face Runtime #2023

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions