Skip to content

Missing C Compiler in TorchTune Trainer #2884

@Electronic-Waste

Description

@Electronic-Waste

What happened?

I launched the torchtune trainer with Python SDK:

from kubeflow.trainer import *

client = TrainerClient()

client.train(
    runtime=client.get_runtime(name="torchtune-llama3.2-1b"),
    initializer=Initializer(
        dataset=HuggingFaceDatasetInitializer(
            storage_uri="hf://tatsu-lab/alpaca/data",
            access_token=<MY_HF_TOKEN>,
        ),
        model=HuggingFaceModelInitializer(
            storage_uri="hf://meta-llama/Llama-3.2-1B-Instruct",
            access_token=<MY_HF_TOKEN>,
        )
    ),
    trainer=BuiltinTrainer(
        config=TorchTuneConfig(
            dataset_preprocess_config=TorchTuneInstructDataset(
                source=DataFormat.PARQUET,
            ),
            resources_per_node={
                "gpu": 1,
            }
        )
    )
)

But got an error:

File "/opt/conda/lib/python3.11/site-packages/bitsandbytes/__init__.py", line 20, in <module>
    from .nn import modules
  File "/opt/conda/lib/python3.11/site-packages/bitsandbytes/nn/__init__.py", line 21, in <module>
    from .triton_based_modules import (
  File "/opt/conda/lib/python3.11/site-packages/bitsandbytes/nn/triton_based_modules.py", line 6, in <module>
    from bitsandbytes.triton.dequantize_rowwise import dequantize_rowwise
  File "/opt/conda/lib/python3.11/site-packages/bitsandbytes/triton/dequantize_rowwise.py", line 18, in <module>
    @triton.autotune(
     ^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 378, in decorator
    return Autotuner(fn, fn.arg_names, configs, key, reset_to_zero, restore_value, pre_hook=pre_hook,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 130, in __init__
    self.do_bench = driver.active.get_benchmarker()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 23, in __getattr__
    self._initialize_obj()
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
    self._obj = self._init_fn()
                ^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 9, in _create_driver
    return actives[0]()
           ^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 535, in __init__
    self.utils = CudaUtils()  # TODO: make static
                 ^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 89, in __init__
    mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 66, in compile_module_from_src
    so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/build.py", line 18, in _build
    raise RuntimeError("Failed to find C compiler. Please specify via CC environment variable.")
RuntimeError: Failed to find C compiler. Please specify via CC environment variable.

/remove-label lifecycle/needs-triage
/area llm

What did you expect to happen?

Successfully complete the fine-tuning process

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.30.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.1

Kubeflow Trainer version:

$ kubectl get pods -n kubeflow-system -l app.kubernetes.io/name=kubeflow-trainer -o jsonpath="{.items[*].spec.containers[*].image}"
My self-built image in PR: https://github.com/kubeflow/trainer/pull/2832

Kubeflow Python SDK version:

$ pip show kubeflow
Name: kubeflow
Version: 0.1.0
Summary: Kubeflow Python SDK to manage ML workloads and to interact with Kubeflow APIs.
Home-page: 
Author: 
Author-email: The Kubeflow Authors <[email protected]>
License: 
Location: /home/ws/miniconda3/envs/training-operator/lib/python3.11/site-packages
Editable project location: /home/ws/kubeflow/kubeflow-sdk
Requires: kubeflow-trainer-api, kubernetes, pydantic
Required-by:

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions