-
Notifications
You must be signed in to change notification settings - Fork 14.5k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Is it possible to automatically mark the RTX 3060 as lower priority for llama-fit?
llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 3090): 24124 total, 48454 used, -24608 free vs. target of 1024
llama_params_fit_impl: - CUDA1 (NVIDIA GeForce RTX 3090): 24124 total, 44650 used, -20804 free vs. target of 1024
llama_params_fit_impl: - CUDA2 (NVIDIA GeForce RTX 3090): 24124 total, 44650 used, -20804 free vs. target of 1024
llama_params_fit_impl: - CUDA3 (NVIDIA GeForce RTX 3060): 11909 total, 18702 used, -6929 free vs. target of 1024
llama_params_fit_impl: projected to use 156457 MiB of device memory vs. 83310 MiB of free device memory
llama_params_fit_impl: cannot meet free memory targets on all devices, need to use 77242 MiB less in total
llama_params_fit_impl: context size reduced from 196608 to 4096 -> need 50578 MiB less memory in total
llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 74633 MiB
llama_params_fit_impl: filling dense-only layers back-to-front:
llama_params_fit_impl: - CUDA3 (NVIDIA GeForce RTX 3060): 63 layers, 5133 MiB used, 6640 MiB free
llama_params_fit_impl: - CUDA2 (NVIDIA GeForce RTX 3090): 0 layers, 0 MiB used, 23845 MiB free
llama_params_fit_impl: - CUDA1 (NVIDIA GeForce RTX 3090): 0 layers, 0 MiB used, 23845 MiB free
llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 3090): 0 layers, 1293 MiB used, 22552 MiB free
llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:
llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 3090): 13 layers ( 1 overflowing), 22746 MiB used, 1098 MiB free
llama_params_fit_impl: - CUDA1 (NVIDIA GeForce RTX 3090): 14 layers ( 1 overflowing), 22603 MiB used, 1241 MiB free
llama_params_fit_impl: - CUDA2 (NVIDIA GeForce RTX 3090): 13 layers ( 1 overflowing), 22397 MiB used, 1448 MiB free
llama_params_fit_impl: - CUDA3 (NVIDIA GeForce RTX 3060): 23 layers (19 overflowing), 10443 MiB used, 1330 MiB free
Motivation
performance
Possible Implementation
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request