Add workflow for building NPU image #8546

wjunLu · 2025-07-04T02:48:07Z

What does this PR do?

Add workflow for building multi-arch (x86 and aarch64) NPU image.

issue

Fixes partially #8540

wjunLu · 2025-07-04T02:56:30Z

I have tested docker_npu workflow on forked repo, and it can successfully run all process before build.

Since the runner space is not enough, it didn't finish the whole process. But the logs show the multi-arch build is OK, arm64 and amd64 images were both building

#15 20.32    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 19.5 MB/s eta 0:00:00
[801](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:807)
#15 ...
[802](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:808)
#14 [linux/amd64 5/8] RUN pip install --no-cache-dir -r requirements.txt
[803](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:809)
#14 21.87    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 121.2 MB/s eta 0:00:00
[804](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:810)
#14 21.88 Downloading nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (200.2 MB)
[806](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:812)
#14 23.85 ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device
[807](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:813)
#14 23.85 
[808](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:814)
#14 23.86    ━━━━━━━━━━━━━━━━━━━━━━━━                 120.3/200.2 MB 60.9 MB/s eta 0:00:02
[809](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:815)
#14 ERROR: process "/bin/bash -c pip install --no-cache-dir -r requirements.txt" did not complete successfully: exit code: 1
[810](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:816)
#15 [linux/arm64 3/8] RUN pip config set global.index-url "https://pypi.org/simple" &&     pip config set global.extra-index-url "https://pypi.org/simple" &&     pip install --no-cache-dir --upgrade pip packaging wheel setuptools
[811](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:817)
#15 23.09 Installing collected packages: wheel, setuptools, pip, packaging
[812](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:818)
#15 23.72   Attempting uninstall: setuptools
[813](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:819)
#15 23.80     Found existing installation: setuptools 65.5.0
[814](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:820)
#15 24.16     Uninstalling setuptools-65.5.0:
[815](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:821)
#15 24.91       Successfully uninstalled setuptools-65.5.0
[816](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:822)
#15 CANCELED
[817](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:823)
------
[818](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:824)
 > [linux/amd64 5/8] RUN pip install --no-cache-dir -r requirements.txt:
[819](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:825)
15.51 Downloading nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB)
[820](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:826)
18.21 Downloading nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (897 kB)
[821](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:827)
18.22    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 459.4 MB/s eta 0:00:00
[822](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:828)
18.23 Downloading nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB)
[823](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:829)
21.88 Downloading nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (200.2 MB)
[824](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:830)
23.85 ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device
[825](https://github.com/wjunLu/LLaMA-Factory/actions/runs/16064458846/job/45336496246#step:9:831)
23.85

.github/workflows/docker_npu.yml

wjunLu · 2025-07-04T10:00:17Z

Since docker/docker-npu/Dockerfile changed, I re-tested the NPU image built from it, the results still OK

(Above picture shows the self-test result about docker_npu.yml)

Start container with the new image quay.io/wjunlu27/llamafactory:0.9.4-npu-a2 (pull from quay.io is faster than docker.io)

docker run -it \
  -v $PWD/hf_cache:/root/.cache/huggingface \
  -v $PWD/ms_cache:/root/.cache/modelscope \
  -v $PWD/data:/app/data \
  -v $PWD/output:/app/output \
  -v /usr/local/dcmi:/usr/local/dcmi \
  -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
  -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
  -v /etc/ascend_install.info:/etc/ascend_install.info \
  -v $PWD/Test:/home/Test/ \
  -p 7861:7860 \
  -p 8010:8000 \
  --device /dev/davinci0 \
  --device /dev/davinci_manager \
  --device /dev/devmm_svm \
  --device /dev/hisi_hdc \
  --shm-size 16G \
  --name llamafactory   quay.io/wjunlu/llamafactory:0.9.4-npu-a2 bash

Install DeepSpeed and ModelScope, and set ENV

pip install -e ".[deepspeed,modelscope]" -i https://pypi.tuna.tsinghua.edu.cn/simple
export ASCEND_RT_VISIBLE_DEVICES=0
export USE_MODELSCOPE_HUB=1

Check llamafactory env

$ llamafactory-cli env

- `llamafactory` version: 0.9.4.dev0
- Platform: Linux-4.19.90-vhulk2211.3.0.h1804.eulerosv2r10.aarch64-aarch64-with-glibc2.35
- Python version: 3.11.12
- PyTorch version: 2.5.1 (NPU)
- Transformers version: 4.52.4
- Datasets version: 3.6.0
- Accelerate version: 1.7.0
- PEFT version: 0.15.2
- TRL version: 0.9.6
- NPU type: Ascend910B3
- CANN version: 8.1.RC1
- Default data directory: detected

Finetune

torchrun \
    --nproc_per_node 1 \
    --nnodes 1 \
    --node_rank 0 \
    --master_addr 127.0.0.1 \
    --master_port 7007 \
    src/train.py /home/Test/qwen1_5_lora_sft_ds.yaml

The result is

...
{'loss': 0.9117, 'grad_norm': 36778.3188305284, 'learning_rate': 3.727866032169127e-05, 'epoch': 1.87}
 63%|████████████████████████████████████████████████████████▋                                 | 928/1473 [09:43<05:37,  1.62it/s]
...
***** Running Evaluation *****
[INFO|trainer.py:4329] 2025-07-04 09:44:43,140 >>   Num examples = 110
[INFO|trainer.py:4332] 2025-07-04 09:44:43,140 >>   Batch size = 1
{'eval_loss': 0.9412215352058411, 'eval_runtime': 8.0786, 'eval_samples_per_second': 13.616, 'eval_steps_per_second': 13.616, 'epoch': 2.04}
 68%|████████████████████████████████████████████████████████████▍                            | 1000/1473 [10:35<04:45,  1.66it/s[INFO|trainer.py:3993] 2025-07-04 09:44:58,198 >> Saving model checkpoint to saves/Qwen1.5-7B/lora/sft/checkpoint-1000
...
[INFO|trainer.py:4332] 2025-07-04 09:50:05,443 >>   Batch size = 1
100%|███████████████████████████████████████████████████████████████████████████████████████████| 110/110 [00:07<00:00, 14.00it/s]
***** eval metrics *****
  epoch                   =        3.0
  eval_loss               =     0.9487
  eval_runtime            = 0:00:07.95
  eval_samples_per_second =     13.834
  eval_steps_per_second   =     13.834
[INFO|modelcard.py:450] 2025-07-04 09:50:13,393 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}

Inference

llamafactory-cli chat \
    --model_name_or_path qwen/Qwen1.5-7B \
    --adapter_name_or_path saves/Qwen1.5-7B/lora/sft \
    --template qwen \
    --finetuning_type lora

The result is

[INFO|configuration_utils.py:1135] 2025-07-04 09:59:22,188 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "max_new_tokens": 2048
}

[INFO|2025-07-04 09:59:22] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[INFO|2025-07-04 09:59:22] llamafactory.model.adapter:143 >> Merged 1 adapter(s).
[INFO|2025-07-04 09:59:22] llamafactory.model.adapter:143 >> Loaded adapter(s): saves/Qwen1.5-7B/lora/sft
[INFO|2025-07-04 09:59:22] llamafactory.model.loader:143 >> all params: 7,721,324,544
Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.

User: 如何放松紧张的精神状态
Assistant: 紧张的精神状态可能会导致身体和心理上的压力，影响日常生活和工作效率。以下是一些方法可以帮助放松紧张的精神状态：

1. 深呼吸：通过深呼吸来缓解紧张情绪，让身体放松。可以慢慢地吸气，然后慢慢地呼气，每次重复几次。

2. 渐进性肌肉松弛：通过有意识地放松肌肉来减轻身体紧张。可以先紧张肌肉，然后放松，逐渐放松整个身体。

3. 沉浸式体验：例如听音乐、看电影、阅读等，让自己沉浸在愉悦的体验中，缓解紧张情绪。

4. 运动：运动可以释放身体内的紧张情绪，例如跑步、瑜伽、散步等。

5. 改变思维方式：通过改变自己的思维方式，例如正面思考、接受现实等，来减轻紧张情绪。

6. 社交：与朋友、家人等进行交流，分享自己的感受，可以减轻紧张情绪。

以上方法可以根据自己的需要和喜好进行选择，但需要注意的是，如果紧张情绪持续存在或影响日常生活，建议咨询专业医生或心理医生。

User:

.github/workflows/docker_gpu.yml

.github/workflows/docker_npu.yml

docker/docker-npu/Dockerfile

hiyouga

LGTM

wjunLu mentioned this pull request Jul 3, 2025

Add workflow for building multi-arch container images for Ascend NPU #8540

Closed

1 task

Yikun reviewed Jul 4, 2025

View reviewed changes

.github/workflows/docker_npu.yml Outdated Show resolved Hide resolved

.github/workflows/docker_npu.yml Outdated Show resolved Hide resolved

wjunLu force-pushed the workflow branch 2 times, most recently from e20da02 to 448cecd Compare July 4, 2025 06:56

hiyouga reviewed Jul 4, 2025

View reviewed changes

.github/workflows/docker_npu.yml Outdated Show resolved Hide resolved

Add workflow for NPU container image and fix x86 build error

ddb8853

wjunLu force-pushed the workflow branch from ae12e99 to ddb8853 Compare July 4, 2025 08:33

replace hub ID

4b958df

hiyouga requested changes Jul 4, 2025

View reviewed changes

fix details

a0328be

wjunLu requested a review from hiyouga July 4, 2025 11:16

hiyouga added 4 commits July 4, 2025 19:28

Update Dockerfile

6ddb949

Update docker-cuda.yml

686210d

Update docker-npu.yml

7a83a4a

Update docker-cuda.yml

6437667

hiyouga approved these changes Jul 4, 2025

View reviewed changes

hiyouga had a problem deploying to docker July 4, 2025 11:32 — with GitHub Actions Error

hiyouga temporarily deployed to docker July 4, 2025 11:32 — with GitHub Actions Inactive

Update Dockerfile

345f199

hiyouga temporarily deployed to docker July 4, 2025 11:45 — with GitHub Actions Inactive

hiyouga added 2 commits July 4, 2025 20:14

Update Dockerfile

00b50a7

Update Dockerfile

f7fcbac

hiyouga approved these changes Jul 4, 2025

View reviewed changes

hiyouga added 2 commits July 4, 2025 20:29

Update and rename docker-cuda.yml to docker.yml

17371e3

Delete .github/workflows/docker-npu.yml

6025eb0

hiyouga temporarily deployed to docker July 4, 2025 12:29 — with GitHub Actions Inactive

hiyouga merged commit d30cbcd into hiyouga:main Jul 4, 2025
17 checks passed

hiyouga added the solved This problem has been already solved label Jul 4, 2025

wjunLu deleted the workflow branch July 10, 2025 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add workflow for building NPU image #8546

Add workflow for building NPU image #8546

Uh oh!

wjunLu commented Jul 4, 2025 •

edited by hiyouga

Loading

Uh oh!

wjunLu commented Jul 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wjunLu commented Jul 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hiyouga left a comment

Uh oh!

Uh oh!

Uh oh!

Add workflow for building NPU image #8546

Add workflow for building NPU image #8546

Uh oh!

Conversation

wjunLu commented Jul 4, 2025 • edited by hiyouga Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

issue

Uh oh!

wjunLu commented Jul 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wjunLu commented Jul 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hiyouga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wjunLu commented Jul 4, 2025 •

edited by hiyouga

Loading