Skip to content

Feature Request: Improve llama fit logic for asymmetric multi-GPU systems #18914

@jacekpoplawski

Description

@jacekpoplawski

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Is it possible to automatically mark the RTX 3060 as lower priority for llama-fit?

llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  24124 total,  48454 used, -24608 free vs. target of   1024
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090):  24124 total,  44650 used, -20804 free vs. target of   1024
llama_params_fit_impl:   - CUDA2 (NVIDIA GeForce RTX 3090):  24124 total,  44650 used, -20804 free vs. target of   1024
llama_params_fit_impl:   - CUDA3 (NVIDIA GeForce RTX 3060):  11909 total,  18702 used,  -6929 free vs. target of   1024
llama_params_fit_impl: projected to use 156457 MiB of device memory vs. 83310 MiB of free device memory
llama_params_fit_impl: cannot meet free memory targets on all devices, need to use 77242 MiB less in total
llama_params_fit_impl: context size reduced from 196608 to 4096 -> need 50578 MiB less memory in total
llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 74633 MiB
llama_params_fit_impl: filling dense-only layers back-to-front:
llama_params_fit_impl:   - CUDA3 (NVIDIA GeForce RTX 3060): 63 layers,   5133 MiB used,   6640 MiB free
llama_params_fit_impl:   - CUDA2 (NVIDIA GeForce RTX 3090):  0 layers,      0 MiB used,  23845 MiB free
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090):  0 layers,      0 MiB used,  23845 MiB free
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  0 layers,   1293 MiB used,  22552 MiB free
llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090): 13 layers ( 1 overflowing),  22746 MiB used,   1098 MiB free
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090): 14 layers ( 1 overflowing),  22603 MiB used,   1241 MiB free
llama_params_fit_impl:   - CUDA2 (NVIDIA GeForce RTX 3090): 13 layers ( 1 overflowing),  22397 MiB used,   1448 MiB free
llama_params_fit_impl:   - CUDA3 (NVIDIA GeForce RTX 3060): 23 layers (19 overflowing),  10443 MiB used,   1330 MiB free

Motivation

performance

Possible Implementation

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions