Skip to content

[Bug]: AMD Radeon RX 9070 XT (16GB) Not Utilizing GPU for SDXL (Persistent CPU Fallback) #608

Open
@barelyapersonphoto

Description

@barelyapersonphoto

Checklist

  • The issue exists after disabling all extensions
  • The issue exists on a clean installation of webui
  • The issue is caused by an extension, but I believe it is caused by a bug in the webui
  • The issue exists in the current version of the webui
  • The issue has not been reported before recently
  • The issue has been reported before but has not been fixed yet

What happened?

Environment:

  • GPU: AMD Radeon RX 9070 XT with 16GB GDDR6 VRAM
  • Operating System: [Your Windows Version, e.g., Windows 10 Pro 22H2 / Windows 11 23H2]
  • **AMD Driver Version: 25.5.1
  • Python Version: 3.10.6 (within WebUI venv)
  • Stable Diffusion WebUI DirectML Version: v1.10.1-amd-36-g679c645e

Problem Description:

Despite torch-directml successfully detecting my AMD GPU, the Stable Diffusion WebUI consistently falls back to CPU for SDXL image generation. The GPU remains at 0% utilization, while the CPU is pinned at 100% and system RAM usage is extremely high (15GB+ for SDXL). This results in excessively long generation times (e.g., 11-30 minutes for a 1024x1024 SDXL image).

The core issue appears to be a persistent failure during model loading related to copying tensors, specifically the VAE, to the GPU device.

Steps to Reproduce:

  1. Launch Stable Diffusion WebUI via webui-user.bat.
  2. Select an SDXL model (e.g., sd_xl_base_1.0.safetensors).
  3. Ensure a compatible sdxl_vae.safetensors is selected in the UI (placed in models/VAE).
  4. Set image resolution to 1024x1024.
  5. Enter any simple prompt (e.g., "a cat").
  6. Initiate image generation.

Expected Behavior:

The AMD Radeon RX 9070 XT GPU should be fully utilized (high 3D/Compute usage in Task Manager, significant VRAM utilization) to process the SDXL model, resulting in generation times significantly faster than CPU-only processing (e.g., typically within 3-8 minutes for 1024x1024 SDXL on DirectML for similar hardware).

Observed Behavior:

During SDXL image generation (1024x1024):

  • GPU Usage (Task Manager Performance Tab - 3D/Compute): Consistently 0%.
  • Dedicated GPU Memory (VRAM): Remains very low (e.g., 1.9GB out of 16GB available).
  • CPU Usage: Pinned at 100%.
  • System RAM Usage: Extremely high (e.g., over 15GB).
  • Generation Time: Excessively long (e.g., 11 minutes for a 20-step 1024x1024 SDXL image, with estimates up to 30 minutes for other runs).
  • The final generated image is correct, indicating the VAE eventually functions despite the underlying loading issue.

Relevant Logs/Errors:

During initial model loading, the following errors were observed consistently in the console (these were from earlier attempts, but describe the underlying problem causing CPU fallback):

While copying the parameter named "first_stage_model.decoder.up.3.block.2.norm2.weight", whose dimensions in the model are torch.Size([512]) and whose dimensions in the checkpoint are torch.Size([512]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).
... (many similar lines for first_stage_model.decoder parameters) ...
While copying the parameter named "first_stage_model.post_quant_conv.bias", whose dimensions in the model are torch.Size([4]) and whose dimensions in the checkpoint are torch.Size([4]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).

(Note: While the "meta tensor" errors might not appear in every latest log, the symptoms (CPU pinned, 0% GPU) indicate this underlying failure to load model data onto the GPU is still the root cause.)

Additionally, the following non-critical warning is also present during model loading but does not prevent the model from eventually loading:

Repository Not Found for url: https://huggingface.co/None/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated. For more details, see https://huggingface.co/docs/huggingface_hub/authentication
Invalid username or password.
...
Failed to create model quickly; will retry using slow method. Model loaded in 18.7s

webui-user.bat Content:

@echo off

set HSA_OVERRIDE_GFX_VERSION=10.3.0
set TORCH_COMMAND=pip install torch-directml
git pull
set COMMANDLINE_ARGS=--autolaunch --skip-torch-cuda-test --no-half --no-half-vae
call webui.bat

Troubleshooting Performed (and their results):

  1. Initial Diagnosis: Identified extremely slow SDXL generation (20+ mins for 1024x1024) and suspected CPU fallback due to Task Manager observations.
  2. xformers Removal: Previously encountered NotImplementedError: No operator found for memory_efficient_attention_forward. Uninstalled xformers via pip uninstall xformers --yes.
    • Result: Eliminated the NotImplementedError, but CPU fallback for SDXL persisted.
  3. COMMANDLINE_ARGS Adjustments: Ensured webui-user.bat included --autolaunch --skip-torch-cuda-test --no-half. Later added --no-half-vae.
    • Result: No change in GPU utilization for SDXL.
  4. torch-directml Device Detection: Ran Python commands within venv:
    • import torch_directml
    • print(torch_directml.is_available()) -> Output: True
    • print(torch_directml.device_name(0)) -> Output: AMD Radeon RX 9070 XT
    • Result: Confirmed that torch-directml can detect and identify the GPU.
  5. SD 1.5 Test: Performed a generation with the built-in SD 1.5 model at 512x512.
    • Result: Generated a "perfect" image in approx. 2 minutes. While this is faster than SDXL, it's still slower than optimal for SD 1.5 on this hardware (expected 15-45s). CPU was still observed as highly active.
  6. Clean Reinstallation of PyTorch/DirectML:
    • Uninstalled torch, torchvision, torchaudio, torch-directml.
    • Reinstalled using the simplified command: pip install torch-directml.
    • Result: Installation completed successfully.
  7. SDXL Test after Reinstall: Performed a 1024x1024 SDXL generation using sd_xl_base_1.0.safetensors and sdxl_vae.safetensors (100MB version).
    • Result: No change. GPU remained at 0% utilization, CPU pinned, high system RAM usage. Generation time approx. 11 minutes.

Steps to reproduce the problem

  1. Ensure you have the stable-diffusion-webui-directml repository cloned to your system.
  2. Set your webui-user.bat file to the following:
    @echo off
    
    set HSA_OVERRIDE_GFX_VERSION=10.3.0
    set TORCH_COMMAND=pip install torch-directml
    git pull
    set COMMANDLINE_ARGS=--autolaunch --skip-torch-cuda-test --no-half --no-half-vae
    call webui.bat
  3. Ensure torch-directml and its dependencies are correctly installed by manually running .\venv\Scripts\activate then pip install torch-directml in the stable-diffusion-webui-directml directory (after uninstalling previous versions if necessary).
  4. Download the official sd_xl_base_1.0.safetensors model and place it in models/Stable-diffusion/.
  5. Download the official sdxl_vae.safetensors and place it in models/VAE/ (create the folder if it doesn't exist).
  6. Double-click webui-user.bat to launch the WebUI.
  7. In the WebUI interface:
    • Select sd_xl_base_1.0.safetensors as the main Stable Diffusion checkpoint.
    • Explicitly select sdxl_vae.safetensors from the VAE dropdown.
    • Set the image resolution to 1024x1024.
    • Enter a simple prompt (e.g., "a cat").
  8. Click "Generate".
  9. While generation is in progress, open Windows Task Manager (Ctrl + Shift + Esc), navigate to the "Performance" tab, and observe your GPU (specifically "3D" or "Compute" graphs), CPU, and System RAM usage.

What should have happened?

The AMD Radeon RX 9070 XT GPU should be primarily utilized for the image generation process. During generation:

  • GPU Usage (Task Manager Performance Tab - 3D/Compute): Should show significant activity (e.g., 50% or higher utilization).
  • Dedicated GPU Memory (VRAM): Should show high utilization (e.g., above 10GB for 1024x1024 SDXL).
  • CPU Usage: Should be lower, handling coordination and data pre-processing, but not pinned at 100%.
  • System RAM Usage: Should be lower, as the model and tensors should reside in VRAM.
  • Generation Time: Should be significantly faster than CPU-only processing, typically completing a 20-step 1024x1024 SDXL image within 3-8 minutes on similar DirectML-enabled hardware.

What browsers do you use to access the UI ?

Mozilla Firefox

Sysinfo

sysinfo-2025-05-22-07-08.json

Console logs

Microsoft Windows [Version 10.0.26100.4061]
(c) Microsoft Corporation. All rights reserved.

C:\Users\thesa>cd C:\stable-diffusion-webui-directml

C:\stable-diffusion-webui-directml>.\venv\Scripts\activate

(venv) C:\stable-diffusion-webui-directml>pip uninstall torch torchvision torchaudio --yes
Found existing installation: torch 2.4.1
Uninstalling torch-2.4.1:
  Successfully uninstalled torch-2.4.1
Found existing installation: torchvision 0.19.1
Uninstalling torchvision-0.19.1:
  Successfully uninstalled torchvision-0.19.1
WARNING: Skipping torchaudio as it is not installed.

(venv) C:\stable-diffusion-webui-directml>pip uninstall torch-directml --yes
Found existing installation: torch-directml 0.2.5.dev240914
Uninstalling torch-directml-0.2.5.dev240914:
  Successfully uninstalled torch-directml-0.2.5.dev240914

(venv) C:\stable-diffusion-webui-directml>pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2 torch-directml
Looking in indexes: https://download.pytorch.org/whl/rocm5.4.2
ERROR: Could not find a version that satisfies the requirement torch (from versions: none)
ERROR: No matching distribution found for torch

(venv) C:\stable-diffusion-webui-directml>pip uninstall torch torchvision torchaudio --yes
WARNING: Skipping torch as it is not installed.
WARNING: Skipping torchvision as it is not installed.
WARNING: Skipping torchaudio as it is not installed.

(venv) C:\stable-diffusion-webui-directml>pip uninstall torch-directml --yes
WARNING: Skipping torch-directml as it is not installed.

(venv) C:\stable-diffusion-webui-directml>pip install torch-directml
Collecting torch-directml
  Using cached torch_directml-0.2.5.dev240914-cp310-cp310-win_amd64.whl.metadata (6.2 kB)
Collecting torch==2.4.1 (from torch-directml)
  Using cached torch-2.4.1-cp310-cp310-win_amd64.whl.metadata (27 kB)
Collecting torchvision==0.19.1 (from torch-directml)
  Using cached torchvision-0.19.1-cp310-cp310-win_amd64.whl.metadata (6.1 kB)
Requirement already satisfied: filelock in c:\stable-diffusion-webui-directml\venv\lib\site-packages (from torch==2.4.1->torch-directml) (3.18.0)
Requirement already satisfied: typing-extensions>=4.8.0 in c:\stable-diffusion-webui-directml\venv\lib\site-packages (from torch==2.4.1->torch-directml) (4.13.2)
Requirement already satisfied: sympy in c:\stable-diffusion-webui-directml\venv\lib\site-packages (from torch==2.4.1->torch-directml) (1.14.0)
Requirement already satisfied: networkx in c:\stable-diffusion-webui-directml\venv\lib\site-packages (from torch==2.4.1->torch-directml) (3.4.2)
Requirement already satisfied: jinja2 in c:\stable-diffusion-webui-directml\venv\lib\site-packages (from torch==2.4.1->torch-directml) (3.1.6)
Requirement already satisfied: fsspec in c:\stable-diffusion-webui-directml\venv\lib\site-packages (from torch==2.4.1->torch-directml) (2025.5.0)
Requirement already satisfied: numpy in c:\stable-diffusion-webui-directml\venv\lib\site-packages (from torchvision==0.19.1->torch-directml) (1.26.2)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in c:\stable-diffusion-webui-directml\venv\lib\site-packages (from torchvision==0.19.1->torch-directml) (9.5.0)
Requirement already satisfied: MarkupSafe>=2.0 in c:\stable-diffusion-webui-directml\venv\lib\site-packages (from jinja2->torch==2.4.1->torch-directml) (2.1.5)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in c:\stable-diffusion-webui-directml\venv\lib\site-packages (from sympy->torch==2.4.1->torch-directml) (1.3.0)
Using cached torch_directml-0.2.5.dev240914-cp310-cp310-win_amd64.whl (9.0 MB)
Using cached torch-2.4.1-cp310-cp310-win_amd64.whl (199.4 MB)
Using cached torchvision-0.19.1-cp310-cp310-win_amd64.whl (1.3 MB)
Installing collected packages: torch, torchvision, torch-directml
Successfully installed torch-2.4.1 torch-directml-0.2.5.dev240914 torchvision-0.19.1

(venv) C:\stable-diffusion-webui-directml>deactivate
C:\stable-diffusion-webui-directml>webui-user.bat
Already up to date.
venv "C:\stable-diffusion-webui-directml\venv\Scripts\Python.exe"
NVIDIA driver was found.
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: v1.10.1-amd-36-g679c645e
Commit hash: 679c645ec84e40dd14d527dbeb03fab259087187
WARNING: you should not skip torch test unless you want CPU to work.
C:\stable-diffusion-webui-directml\venv\lib\site-packages\onnxscript\converter.py:816: FutureWarning: 'onnxscript.values.Op.param_schemas' is deprecated in version 0.1 and will be removed in the future. Please use '.op_signature' instead.
  param_schemas = callee.param_schemas()
C:\stable-diffusion-webui-directml\venv\lib\site-packages\onnxscript\converter.py:816: FutureWarning: 'onnxscript.values.OnnxFunction.param_schemas' is deprecated in version 0.1 and will be removed in the future. Please use '.op_signature' instead.
  param_schemas = callee.param_schemas()
C:\stable-diffusion-webui-directml\venv\lib\site-packages\timm\models\layers\__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
C:\stable-diffusion-webui-directml\venv\lib\site-packages\pytorch_lightning\utilities\distributed.py:258: LightningDeprecationWarning: `pytorch_lightning.utilities.distributed.rank_zero_only` has been deprecated in v1.8.1 and will be removed in v2.0.0. You can import it from `pytorch_lightning.utilities` instead.
  rank_zero_deprecation(
Launching Web UI with arguments: --autolaunch --skip-torch-cuda-test --no-half --no-half-vae
Warning: caught exception 'Torch not compiled with CUDA enabled', memory monitor disabled
ONNX: version=1.22.0 provider=CUDAExecutionProvider, available=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
Loading weights [31e35c80fc] from C:\stable-diffusion-webui-directml\models\Stable-diffusion\sd_xl_base_1.0.safetensors
Running on local URL:  http://127.0.0.1:7860
Creating model from config: C:\stable-diffusion-webui-directml\repositories\generative-models\configs\inference\sd_xl_base.yaml

To create a public link, set `share=True` in `launch()`.
Startup time: 12.7s (prepare environment: 19.7s, initialize shared: 0.8s, load scripts: 0.7s, create ui: 0.5s, gradio launch: 0.5s).
creating model quickly: OSError
Traceback (most recent call last):
  File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\huggingface_hub\utils\_http.py", line 409, in hf_raise_for_status
    response.raise_for_status()
  File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\requests\models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/None/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\transformers\utils\hub.py", line 342, in cached_file
    resolved_file = hf_hub_download(
  File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\huggingface_hub\utils\_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\huggingface_hub\file_download.py", line 1008, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
  File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\huggingface_hub\file_download.py", line 1115, in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
  File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\huggingface_hub\file_download.py", line 1643, in _raise_on_head_call_error
    raise head_call_error
  File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\huggingface_hub\file_download.py", line 1531, in _get_metadata_or_catch_error
    metadata = get_hf_file_metadata(
  File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\huggingface_hub\utils\_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\huggingface_hub\file_download.py", line 1448, in get_hf_file_metadata
    r = _request_wrapper(
  File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\huggingface_hub\file_download.py", line 286, in _request_wrapper
    response = _request_wrapper(
  File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\huggingface_hub\file_download.py", line 310, in _request_wrapper
    hf_raise_for_status(response)
  File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\huggingface_hub\utils\_http.py", line 459, in hf_raise_for_status
    raise _format(RepositoryNotFoundError, message, response) from e
huggingface_hub.errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-682ec8dc-5d76afce6cadc6ad5a4fc169;6d81d0b7-e77f-4183-af74-72ba481f67f5)

Repository Not Found for url: https://huggingface.co/None/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated. For more details, see https://huggingface.co/docs/huggingface_hub/authentication
Invalid username or password.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\thesa\AppData\Local\Programs\Python\Python310\lib\threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "C:\Users\thesa\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "C:\Users\thesa\AppData\Local\Programs\Python\Python310\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "C:\stable-diffusion-webui-directml\modules\initialize.py", line 149, in load_model
    shared.sd_model  # noqa: B018
  File "C:\stable-diffusion-webui-directml\modules\shared_items.py", line 190, in sd_model
    return modules.sd_models.model_data.get_sd_model()
  File "C:\stable-diffusion-webui-directml\modules\sd_models.py", line 693, in get_sd_model
    load_model()
  File "C:\stable-diffusion-webui-directml\modules\sd_models.py", line 831, in load_model
    sd_model = instantiate_from_config(sd_config.model, state_dict)
  File "C:\stable-diffusion-webui-directml\modules\sd_models.py", line 775, in instantiate_from_config
    return constructor(**params)
  File "C:\stable-diffusion-webui-directml\repositories\generative-models\sgm\models\diffusion.py", line 61, in __init__
    self.conditioner = instantiate_from_config(
  File "C:\stable-diffusion-webui-directml\repositories\generative-models\sgm\util.py", line 175, in instantiate_from_config
    return get_obj_from_str(config["target"])(**config.get("params", dict()))
  File "C:\stable-diffusion-webui-directml\repositories\generative-models\sgm\modules\encoders\modules.py", line 88, in __init__
    embedder = instantiate_from_config(embconfig)
  File "C:\stable-diffusion-webui-directml\repositories\generative-models\sgm\util.py", line 175, in instantiate_from_config
    return get_obj_from_str(config["target"])(**config.get("params", dict()))
  File "C:\stable-diffusion-webui-directml\repositories\generative-models\sgm\modules\encoders\modules.py", line 361, in __init__
    self.transformer = CLIPTextModel.from_pretrained(version)
  File "C:\stable-diffusion-webui-directml\modules\sd_disable_initialization.py", line 68, in CLIPTextModel_from_pretrained
    res = self.CLIPTextModel_from_pretrained(None, *model_args, config=pretrained_model_name_or_path, state_dict={}, **kwargs)
  File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\transformers\modeling_utils.py", line 262, in _wrapper
    return func(*args, **kwargs)
  File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\transformers\modeling_utils.py", line 3540, in from_pretrained
    resolved_config_file = cached_file(
  File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\transformers\utils\hub.py", line 365, in cached_file
    raise EnvironmentError(
OSError: None is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

Failed to create model quickly; will retry using slow method.
C:\stable-diffusion-webui-directml\modules\safe.py:156: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  return unsafe_torch_load(filename, *args, **kwargs)
Applying attention optimization: InvokeAI... done.
Model loaded in 20.4s (load weights from disk: 0.7s, create model: 10.3s, apply weights to model: 7.0s, apply float(): 1.8s, calculate empty prompt: 0.4s).
Loading VAE weights specified in settings: C:\stable-diffusion-webui-directml\models\VAE\fixFP16ErrorsSDXLLowerMemoryUse_v10.safetensors
Applying attention optimization: InvokeAI... done.
VAE weights loaded.
Calculating sha256 for C:\stable-diffusion-webui-directml\models\VAE\fixFP16ErrorsSDXLLowerMemoryUse_v10.safetensors: 235745af8d86bf4a4c1b5b4f529868b37019a10f7c0b2e79ad0abca3a22bc6e1
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [09:38<00:00, 28.92s/it]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 20/20 [09:31<00:00, 28.55s/it]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 20/20 [09:31<00:00, 29.34s/it]

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions