Feature Request: Improve llama fit logic for asymmetric multi-GPU systems

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

Is it possible to automatically mark the RTX 3060 as lower priority for llama-fit?

```
llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  24124 total,  48454 used, -24608 free vs. target of   1024
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090):  24124 total,  44650 used, -20804 free vs. target of   1024
llama_params_fit_impl:   - CUDA2 (NVIDIA GeForce RTX 3090):  24124 total,  44650 used, -20804 free vs. target of   1024
llama_params_fit_impl:   - CUDA3 (NVIDIA GeForce RTX 3060):  11909 total,  18702 used,  -6929 free vs. target of   1024
llama_params_fit_impl: projected to use 156457 MiB of device memory vs. 83310 MiB of free device memory
llama_params_fit_impl: cannot meet free memory targets on all devices, need to use 77242 MiB less in total
llama_params_fit_impl: context size reduced from 196608 to 4096 -> need 50578 MiB less memory in total
llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 74633 MiB
llama_params_fit_impl: filling dense-only layers back-to-front:
llama_params_fit_impl:   - CUDA3 (NVIDIA GeForce RTX 3060): 63 layers,   5133 MiB used,   6640 MiB free
llama_params_fit_impl:   - CUDA2 (NVIDIA GeForce RTX 3090):  0 layers,      0 MiB used,  23845 MiB free
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090):  0 layers,      0 MiB used,  23845 MiB free
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  0 layers,   1293 MiB used,  22552 MiB free
llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090): 13 layers ( 1 overflowing),  22746 MiB used,   1098 MiB free
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090): 14 layers ( 1 overflowing),  22603 MiB used,   1241 MiB free
llama_params_fit_impl:   - CUDA2 (NVIDIA GeForce RTX 3090): 13 layers ( 1 overflowing),  22397 MiB used,   1448 MiB free
llama_params_fit_impl:   - CUDA3 (NVIDIA GeForce RTX 3060): 23 layers (19 overflowing),  10443 MiB used,   1330 MiB free
```

### Motivation

performance

### Possible Implementation

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Improve llama fit logic for asymmetric multi-GPU systems #18914

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Improve llama fit logic for asymmetric multi-GPU systems #18914

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions