-
Notifications
You must be signed in to change notification settings - Fork 828
Open
Description
What happened?
I launched the torchtune trainer with Python SDK:
from kubeflow.trainer import *
client = TrainerClient()
client.train(
runtime=client.get_runtime(name="torchtune-llama3.2-1b"),
initializer=Initializer(
dataset=HuggingFaceDatasetInitializer(
storage_uri="hf://tatsu-lab/alpaca/data",
access_token=<MY_HF_TOKEN>,
),
model=HuggingFaceModelInitializer(
storage_uri="hf://meta-llama/Llama-3.2-1B-Instruct",
access_token=<MY_HF_TOKEN>,
)
),
trainer=BuiltinTrainer(
config=TorchTuneConfig(
dataset_preprocess_config=TorchTuneInstructDataset(
source=DataFormat.PARQUET,
),
resources_per_node={
"gpu": 1,
}
)
)
)
But got an error:
File "/opt/conda/lib/python3.11/site-packages/bitsandbytes/__init__.py", line 20, in <module>
from .nn import modules
File "/opt/conda/lib/python3.11/site-packages/bitsandbytes/nn/__init__.py", line 21, in <module>
from .triton_based_modules import (
File "/opt/conda/lib/python3.11/site-packages/bitsandbytes/nn/triton_based_modules.py", line 6, in <module>
from bitsandbytes.triton.dequantize_rowwise import dequantize_rowwise
File "/opt/conda/lib/python3.11/site-packages/bitsandbytes/triton/dequantize_rowwise.py", line 18, in <module>
@triton.autotune(
^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 378, in decorator
return Autotuner(fn, fn.arg_names, configs, key, reset_to_zero, restore_value, pre_hook=pre_hook,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 130, in __init__
self.do_bench = driver.active.get_benchmarker()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 23, in __getattr__
self._initialize_obj()
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
self._obj = self._init_fn()
^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 9, in _create_driver
return actives[0]()
^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 535, in __init__
self.utils = CudaUtils() # TODO: make static
^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 89, in __init__
mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 66, in compile_module_from_src
so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/build.py", line 18, in _build
raise RuntimeError("Failed to find C compiler. Please specify via CC environment variable.")
RuntimeError: Failed to find C compiler. Please specify via CC environment variable.
/remove-label lifecycle/needs-triage
/area llm
What did you expect to happen?
Successfully complete the fine-tuning process
Environment
Kubernetes version:
$ kubectl version
Client Version: v1.30.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.1
Kubeflow Trainer version:
$ kubectl get pods -n kubeflow-system -l app.kubernetes.io/name=kubeflow-trainer -o jsonpath="{.items[*].spec.containers[*].image}"
My self-built image in PR: https://github.com/kubeflow/trainer/pull/2832
Kubeflow Python SDK version:
$ pip show kubeflow
Name: kubeflow
Version: 0.1.0
Summary: Kubeflow Python SDK to manage ML workloads and to interact with Kubeflow APIs.
Home-page:
Author:
Author-email: The Kubeflow Authors <[email protected]>
License:
Location: /home/ws/miniconda3/envs/training-operator/lib/python3.11/site-packages
Editable project location: /home/ws/kubeflow/kubeflow-sdk
Requires: kubeflow-trainer-api, kubernetes, pydantic
Required-by:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍