Skip to content

Benchmarking script always raises valueError #529

@joelvdvoort

Description

@joelvdvoort

I don't understand why my benchmarking attempts are failing.

Model used:

apiVersion: kubeai.org/v1
kind: Model
metadata:
  labels:
    argocd.argoproj.io/instance: kubeai-models-prod
    features.kubeai.org/TextGeneration: "true"
  name: llama-3.3-70b
  namespace: kubeai
spec:
  args:
  - --max-model-len=4092
  - --max-num-batched-token=8192
  - --gpu-memory-utilization=0.95
  - --enforce-eager
  - --disable-log-requests
  - --max-num-seqs=16
  - --quantization=bitsandbytes
  - --load-format=bitsandbytes
  engine: VLLM
  features:
  - TextGeneration
  loadBalancing:
    prefixHash:
      meanLoadFactor: 125
      prefixCharLength: 100
      replication: 256
    strategy: LeastLoad
  maxReplicas: 2
  minReplicas: 1
  owner: ""
  replicas: 1
  resourceProfile: nvidia-gpu-l40s-SHARED-large:1
  scaleDownDelaySeconds: 30
  targetRequests: 100
  url: hf://unsloth/Llama-3.3-70B-Instruct-bnb-4bit

The job that I used (which I sourced from your example repo):

apiVersion: batch/v1
kind: Job
metadata:
  name: benchmark-serving
spec:
  template:
    spec:
      containers:
        - name: benchmark-serving
          image: substratusai/benchmark_serving:v0.0.1
          args:
            - --base-url=http://kubeai/openai
            - --dataset-name=sharegpt
            - --dataset-path=/app/sharegpt_16_messages_or_more.json
            - --model=llama-3.1-8b-instruct-fp8-l4
            - --seed=12345
            - --tokenizer=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
            - --request-rate=200
            - --max-concurrency=1600
            - --num-prompts=8000
            - --max-conversations=800
      restartPolicy: Never

I get the following output:

k -n kubeai logs jobs/benchmark-serving benchmark-serving
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Namespace(backend='vllm', base_url='http://kubeai/openai', host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='/app/sharegpt_16_messages_or_more.json', max_concurrency=800, model='llama-3.1-8b-instruct-fp8-l4', tokenizer='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8', best_of=1, use_beam_search=False, num_prompts=8000, max_conversations=800, logprobs=None, request_rate=200.0, burstiness=1.0, seed=12345, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
Starting initial single prompt test run...
Traceback (most recent call last):
  File "/app/benchmark_serving.py", line 1317, in <module>
    main(args)
  File "/app/benchmark_serving.py", line 943, in main
    benchmark_result = asyncio.run(
  File "/usr/local/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/app/benchmark_serving.py", line 617, in benchmark
    raise ValueError(
ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Not Found

I've already tried to fiddle with different models in the job spec, different tokenizer values; I always get the same ValueError in the output. Any help would be much appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions