Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [llama2-7B] fail to execute Llama-2-7b-chat-hf-q4f16_1-MLC #1551

Closed
SunCrazy opened this issue Jan 7, 2024 · 45 comments
Closed

[Bug] [llama2-7B] fail to execute Llama-2-7b-chat-hf-q4f16_1-MLC #1551

SunCrazy opened this issue Jan 7, 2024 · 45 comments
Labels
bug Confirmed bugs

Comments

@SunCrazy
Copy link

SunCrazy commented Jan 7, 2024

🐛 Bug

When I execute Llama-2-7b-chat-hf-q4f16_1-MLC, some errors are raised as follow:

(relax-py10) [chenf@b11b0623:/mnt/ssd/chenf/project/mlc_models/example/llama2]$ /mnt/ssd/chenf/opensource/mlc-llm/build/mlc_chat_cli --model-lib-path prebuilt_libs/Llama-2-7b-chat-hf-q4f16_1-cuda.so --model Llama-2-7b-chat-hf-q4f16_1-MLC/
Use MLC config: "/mnt/ssd/chenf/project/mlc_models/example/llama2/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json"
Use model weights: "/mnt/ssd/chenf/project/mlc_models/example/llama2/Llama-2-7b-chat-hf-q4f16_1-MLC/ndarray-cache.json"
Use model library: "prebuilt_libs/Llama-2-7b-chat-hf-q4f16_1-cuda.so"
You can use the following special commands:
  /help               print the special commands
  /exit               quit the cli
  /stats              print out the latest stats (token/sec)
  /reset              restart a fresh chat
  /reload [model]  reload model `model` from disk, or reload the current model if `model` is not specified

Loading model...
Loading finished
Running system prompts...
[11:32:03] /mnt/ssd/chenf/opensource/mlc-llm/3rdparty/tvm/src/runtime/relax_vm/builtin.cc:310: Check failed: (static_cast<int64_t>(ptr->size()) == size) is false: ValueError: ErrorContext(fn=prefill, loc=param[3], param=params, annotation=R.Tuple(R.Tensor((v, 512), dtype="uint32"), R.Tensor((v, 128), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((12288, 512), dtype="uint32"), R.Tensor((12288, 128), dtype="float16"), R.Tensor((4096, 512), dtype="uint32"), R.Tensor((4096, 128), dtype="float16"), R.Tensor((22016, 512), dtype="uint32"), R.Tensor((22016, 128), dtype="float16"), R.Tensor((4096, 1376), dtype="uint32"), R.Tensor((4096, 344), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((4096,), dtype="float16"), R.Tensor((v, 512), dtype="uint32"), R.Tensor((v, 128), dtype="float16"), R.Tensor((cache_len, 128), dtype="float16"), R.Tensor((cache_len, 128), dtype="float16")))  expect a Tuple with 327 elements,  but get a Tuple with 0 elements.
Stack trace:
  [bt] (0) /mnt/ssd/chenf/opensource/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x2c) [0x7f58a951c3ac]
  [bt] (1) /mnt/ssd/chenf/opensource/mlc-llm/build/mlc_chat_cli(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x3d) [0x56421e368aad]
  [bt] (2) /mnt/ssd/chenf/opensource/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::CheckTupleInfo(tvm::runtime::ObjectRef, long, tvm::runtime::Optional<tvm::runtime::String>)+0x29f) [0x7f58a957515f]
  [bt] (3) /mnt/ssd/chenf/opensource/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::runtime::ObjectRef, long, tvm::runtime::Optional<tvm::runtime::String>)>::AssignTypedLambda<void (*)(tvm::runtime::ObjectRef, long, tvm::runtime::Optional<tvm::runtime::String>)>(void (*)(tvm::runtime::ObjectRef, long, tvm::runtime::Optional<tvm::runtime::String>), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x216) [0x7f58a9589c96]
  [bt] (4) /mnt/ssd/chenf/opensource/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x7d) [0x7f58a95d271d]
  [bt] (5) /mnt/ssd/chenf/opensource/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)+0x90b) [0x7f58a95d471b]
  [bt] (6) /mnt/ssd/chenf/opensource/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()+0x235) [0x7f58a95d3ba5]
  [bt] (7) /mnt/ssd/chenf/opensource/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)+0x2a4) [0x7f58a95d56b4]
  [bt] (8) /mnt/ssd/chenf/opensource/mlc-llm/build/tvm/libtvm_runtime.so(+0x218cba) [0x7f58a95d5cba]

To Reproduce

Steps to reproduce the behavior:

  1. compile mlc runtime from source (use tvm in submodule); mlc-llm commit: 5e23900
  2. download model and prebuild_lib from provided url;
  3. execute command: /mnt/ssd/chenf/opensource/mlc-llm/build/mlc_chat_cli --model-lib-path prebuilt_libs/Llama-2-7b-chat-hf-q4f16_1-cuda.so --model Llama-2-7b-chat-hf-q4f16_1-MLC/

Environment

  • Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
  • Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 20.04
  • Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): Tesla V100
  • How you installed MLC-LLM (conda, source): source
  • How you installed TVM-Unity (pip, source): source
  • Python version (e.g. 3.10): 3.10
  • GPU driver version (if applicable):
  • CUDA/cuDNN version (if applicable):
  • TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
USE_NVTX: OFF
USE_GTEST: AUTO
SUMMARIZE: OFF
USE_IOS_RPC: OFF
USE_MSC: OFF
USE_ETHOSU: OFF
CUDA_VERSION: 11.6
USE_LIBBACKTRACE: AUTO
DLPACK_PATH: 3rdparty/dlpack/include
USE_TENSORRT_CODEGEN: OFF
USE_THRUST: OFF
USE_TARGET_ONNX: OFF
USE_AOT_EXECUTOR: ON
BUILD_DUMMY_LIBTVM: OFF
USE_CUDNN: OFF
USE_TENSORRT_RUNTIME: OFF
USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF
USE_CCACHE: AUTO
USE_ARM_COMPUTE_LIB: OFF
USE_CPP_RTVM: OFF
USE_OPENCL_GTEST: /path/to/opencl/gtest
USE_MKL: OFF
USE_PT_TVMDSOOP: OFF
MLIR_VERSION: NOT-FOUND
USE_CLML: OFF
USE_STACKVM_RUNTIME: OFF
USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF
ROCM_PATH: /opt/rocm
USE_DNNL: OFF
USE_VITIS_AI: OFF
USE_MLIR: OFF
USE_RCCL: OFF
USE_LLVM: /mnt/ssd/chenf/software/miniconda3/envs/relax-py10/bin/llvm-config
USE_VERILATOR: OFF
USE_TF_TVMDSOOP: OFF
USE_THREADS: ON
USE_MSVC_MT: OFF
BACKTRACE_ON_SEGFAULT: OFF
USE_GRAPH_EXECUTOR: ON
USE_NCCL: OFF
USE_ROCBLAS: OFF
GIT_COMMIT_HASH: 7dfc863df8b6c9227a03547e5a0bf23f44c3f62d
USE_VULKAN: OFF
USE_RUST_EXT: OFF
USE_CUTLASS: OFF
USE_CPP_RPC: OFF
USE_HEXAGON: OFF
USE_CUSTOM_LOGGING: OFF
USE_UMA: OFF
USE_FALLBACK_STL_MAP: OFF
USE_SORT: ON
USE_RTTI: ON
GIT_COMMIT_TIME: 2024-01-04 15:14:07 +0800
USE_HEXAGON_SDK: /path/to/sdk
USE_BLAS: none
USE_ETHOSN: OFF
USE_LIBTORCH: OFF
USE_RANDOM: ON
USE_CUDA: ON
USE_COREML: OFF
USE_AMX: OFF
BUILD_STATIC_RUNTIME: OFF
USE_CMSISNN: OFF
USE_KHRONOS_SPIRV: OFF
USE_CLML_GRAPH_EXECUTOR: OFF
USE_TFLITE: OFF
USE_HEXAGON_GTEST: /path/to/hexagon/gtest
PICOJSON_PATH: 3rdparty/picojson
USE_OPENCL_ENABLE_HOST_PTR: OFF
INSTALL_DEV: OFF
USE_PROFILER: ON
USE_NNPACK: OFF
LLVM_VERSION: 17.0.6
USE_OPENCL: OFF
COMPILER_RT_PATH: 3rdparty/compiler-rt
RANG_PATH: 3rdparty/rang/include
USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF
USE_OPENMP: none
USE_BNNS: OFF
USE_CUBLAS: OFF
USE_METAL: OFF
USE_MICRO_STANDALONE_RUNTIME: OFF
USE_HEXAGON_EXTERNAL_LIBS: OFF
USE_ALTERNATIVE_LINKER: AUTO
USE_BYODT_POSIT: OFF
USE_HEXAGON_RPC: OFF
USE_MICRO: OFF
DMLC_PATH: 3rdparty/dmlc-core/include
INDEX_DEFAULT_I64: ON
USE_RELAY_DEBUG: OFF
USE_RPC: ON
USE_TENSORFLOW_PATH: none
TVM_CLML_VERSION:
USE_MIOPEN: OFF
USE_ROCM: OFF
USE_PAPI: OFF
USE_CURAND: OFF
TVM_CXX_COMPILER_PATH: /usr/bin/c++
HIDE_PRIVATE_SYMBOLS: ON
  • Any other relevant information:
@SunCrazy SunCrazy added the bug Confirmed bugs label Jan 7, 2024
@junrushao
Copy link
Member

The C++-based CLI is less maintained, and we could instead use Python APIs.

Could you instead do:

rm -rf prebuilt_libs/

And then use the Python script below:

from mlc_chat import ChatConfig, ChatModule, callback
from mlc_chat.support import logging

logging.enable_logging()

MODEL = "/mnt/ssd/chenf/project/mlc_models/example/llama2/Llama-2-7b-chat-hf-q4f16_1-MLC/"

def main():
    cm = ChatModule(
        MODEL,
        chat_config=ChatConfig(context_window_size=1024),
    )
    cm.generate(
        "What is the meaning of life?",
        progress_callback=callback.StreamToStdout(callback_interval=2),
    )

if __name__ == "__main__":
    main()

@SunCrazy
Copy link
Author

SunCrazy commented Jan 7, 2024

The C++-based CLI is less maintained, and we could instead use Python APIs.

Could you instead do:

rm -rf prebuilt_libs/

And then use the Python script below:

from mlc_chat import ChatConfig, ChatModule, callback
from mlc_chat.support import logging

logging.enable_logging()

MODEL = "/mnt/ssd/chenf/project/mlc_models/example/llama2/Llama-2-7b-chat-hf-q4f16_1-MLC/"

def main():
    cm = ChatModule(
        MODEL,
        chat_config=ChatConfig(context_window_size=1024),
    )
    cm.generate(
        "What is the meaning of life?",
        progress_callback=callback.StreamToStdout(callback_interval=2),
    )

if __name__ == "__main__":
    main()

@junrushao Thanks for your reply! I change to execute the python script above, but get new errors:

I guess I didn't compile a dependency when compiling TVM, right?

[2024-01-07 17:08:31] INFO auto_device.py:76: Found device: cuda:0
[2024-01-07 17:08:32] INFO auto_device.py:85: Not found device: rocm:0
[2024-01-07 17:08:32] INFO auto_device.py:85: Not found device: metal:0
[2024-01-07 17:08:32] INFO auto_device.py:85: Not found device: vulkan:0
[2024-01-07 17:08:33] INFO auto_device.py:85: Not found device: opencl:0
[2024-01-07 17:08:33] INFO auto_device.py:33: Using device: cuda:0
[2024-01-07 17:08:33] INFO chat_module.py:366: Using model folder: /mnt/ssd/chenf/project/mlc_models/example/llama2/Llama-2-7b-chat-hf-q4f16_1-MLC
[2024-01-07 17:08:33] INFO chat_module.py:367: Using mlc chat config: /mnt/ssd/chenf/project/mlc_models/example/llama2/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json
[2024-01-07 17:08:33] INFO chat_module.py:756: Model lib not found. Now compiling model lib on device...
[2024-01-07 17:08:33] INFO llama_model.py:79: Overriding prefill_chunk_size from 4096 to 1024 (context_window_size)
[2024-01-07 17:08:33] INFO jit.py:83: Compiling using commands below:
[2024-01-07 17:08:33] INFO jit.py:84: /mnt/ssd/chenf/software/miniconda3/envs/relax-py10/bin/python -m mlc_chat compile /mnt/ssd/chenf/project/mlc_models/example/llama2/Llama-2-7b-chat-hf-q4f16_1-MLC --opt 'flashinfer=1;cublas_gemm=1;cudagraph=0' --overrides 'context_window_size=1024;prefill_chunk_size=4096;tensor_parallel_shards=1' --device cuda:0 --output /tmp/tmpqjedea0v/lib.so
[2024-01-07 17:08:34] INFO auto_config.py:69: Found model configuration: /mnt/ssd/chenf/project/mlc_models/example/llama2/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json
[2024-01-07 17:08:34] INFO auto_target.py:75: Detecting target device: cuda:0
[2024-01-07 17:08:34] INFO auto_target.py:77: Found target: {"thread_warp_size": 32, "arch": "sm_70", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]}
[2024-01-07 17:08:34] INFO auto_target.py:94: Found host LLVM triple: x86_64-conda-linux-gnu
[2024-01-07 17:08:34] INFO auto_target.py:95: Found host LLVM CPU: skylake-avx512
[2024-01-07 17:08:34] INFO auto_target.py:242: Generating code for CUDA architecture: sm_70
[2024-01-07 17:08:34] INFO auto_target.py:243: To produce multi-arch fatbin, set environment variable MLC_MULTI_ARCH. Example: MLC_MULTI_ARCH=70,72,75,80,86,87,89,90
[2024-01-07 17:08:34] INFO auto_config.py:151: Found model type: llama. Use `--model-type` to override.
[2024-01-07 17:08:34] WARNING compiler_flags.py:67: flashinfer is not supported on CUDA arch < 80
Compiling with arguments:
  --config          LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-06, vocab_size=32000, position_embedding_base=10000, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, max_batch_size=1, kwargs={})
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
  --model-type      llama
  --target          {"thread_warp_size": 32, "host": {"mtriple": "x86_64-conda-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "skylake-avx512", "keys": ["cpu"]}, "arch": "sm_70", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]}
  --opt             flashinfer=0;cublas_gemm=0;cudagraph=0
  --system-lib-prefix ""
  --output          /tmp/tmpqjedea0v/lib.so
  --overrides       context_window_size=1024;sliding_window_size=None;prefill_chunk_size=4096;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1
[2024-01-07 17:08:34] INFO compiler_flags.py:118: Overriding context_window_size from 4096 to 1024
[2024-01-07 17:08:34] INFO compiler_flags.py:118: Overriding prefill_chunk_size from 4096 to 4096
[2024-01-07 17:08:34] INFO compiler_flags.py:118: Overriding tensor_parallel_shards from 1 to 1
[2024-01-07 17:08:34] INFO llama_model.py:79: Overriding prefill_chunk_size from 4096 to 1024 (context_window_size)
[2024-01-07 17:08:34] INFO compile.py:131: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-06, vocab_size=32000, position_embedding_base=10000, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, max_batch_size=1, kwargs={})
[2024-01-07 17:08:34] INFO compile.py:141: Exporting the model to TVM Unity compiler
[2024-01-07 17:08:39] INFO compile.py:147: Running optimizations using TVM Unity
[2024-01-07 17:08:39] INFO compile.py:160: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 1024, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 1024, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 536870912}
[2024-01-07 17:08:39] INFO pipeline.py:35: Running TVM Relax graph-level optimizations
[2024-01-07 17:08:42] INFO pipeline.py:35: Lowering to TVM TIR kernels
[2024-01-07 17:08:48] INFO pipeline.py:35: Running TVM TIR-level optimizations
[2024-01-07 17:08:57] INFO pipeline.py:35: Running TVM Dlight low-level optimizations
[2024-01-07 17:09:09] INFO pipeline.py:35: Lowering to VM bytecode
[2024-01-07 17:09:10] INFO estimate_memory_usage.py:55: [Memory usage] Function `_initialize_effect`: 0.00 MB
[2024-01-07 17:09:10] INFO estimate_memory_usage.py:55: [Memory usage] Function `batch_decode`: 0.09 MB
[2024-01-07 17:09:10] INFO estimate_memory_usage.py:55: [Memory usage] Function `batch_prefill`: 88.51 MB
[2024-01-07 17:09:10] INFO estimate_memory_usage.py:55: [Memory usage] Function `create_flashinfer_paged_kv_cache`: 0.00 MB
[2024-01-07 17:09:10] INFO estimate_memory_usage.py:55: [Memory usage] Function `decode`: 8.36 MB
[2024-01-07 17:09:10] INFO estimate_memory_usage.py:55: [Memory usage] Function `embed`: 8.00 MB
[2024-01-07 17:09:10] INFO estimate_memory_usage.py:55: [Memory usage] Function `prefill`: 245.63 MB
[2024-01-07 17:09:10] INFO estimate_memory_usage.py:55: [Memory usage] Function `softmax_with_temperature`: 0.12 MB
[2024-01-07 17:09:11] INFO pipeline.py:35: Compiling external modules
[2024-01-07 17:09:11] INFO pipeline.py:35: Compilation complete! Exporting to disk
[2024-01-07 17:09:22] INFO compile.py:175: Generated: /tmp/tmpqjedea0v/lib.so
[2024-01-07 17:09:22] INFO jit.py:87: Using compiled model lib: /iothome/chenf/.cache/mlc_chat/model_lib/beca1ee3070ef1b4cd8bbddad9a1c09d.so
[2024-01-07 17:09:23] ERROR model_metadata.py:93: FAILED to read metadata section in legacy model lib.
Traceback (most recent call last):
  File "/mnt/ssd/chenf/opensource/mlc-llm/python/mlc_chat/cli/model_metadata.py", line 91, in main
    metadata = _extract_metadata(parsed.model_lib)
  File "/mnt/ssd/chenf/opensource/mlc-llm/python/mlc_chat/cli/model_metadata.py", line 24, in _extract_metadata
    return json.loads(VirtualMachine(load_module(model_lib), device("cpu"))["_metadata"]())
  File "/mnt/ssd/chenf/opensource/tvm/python/tvm/runtime/relax_vm.py", line 97, in __init__
    self._setup_device(device, memory_cfg)
  File "/mnt/ssd/chenf/opensource/tvm/python/tvm/runtime/relax_vm.py", line 133, in _setup_device
    self.module["vm_initialization"](*init_args)
  File "/mnt/ssd/chenf/opensource/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "/mnt/ssd/chenf/opensource/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "/mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc", line 839, in tvm::runtime::relax_vm::VirtualMachineImpl::_Init(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
    this->Init(devices, alloc_types);
  File "/mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc", line 462, in tvm::runtime::relax_vm::VirtualMachineImpl::Init(std::vector<DLDevice, std::allocator<DLDevice> > const&, std::vector<tvm::runtime::memory::AllocatorType, std::allocator<tvm::runtime::memory::AllocatorType> > const&)
    this->InitFuncPool();
  File "/mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc", line 676, in tvm::runtime::relax_vm::VirtualMachineImpl::InitFuncPool()
    ICHECK(func.defined())
tvm.error.InternalError: Traceback (most recent call last):
  2: tvm::runtime::relax_vm::VirtualMachineImpl::_Init(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
        at /mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc:839
  1: tvm::runtime::relax_vm::VirtualMachineImpl::Init(std::vector<DLDevice, std::allocator<DLDevice> > const&, std::vector<tvm::runtime::memory::AllocatorType, std::allocator<tvm::runtime::memory::AllocatorType> > const&)
        at /mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc:462
  0: tvm::runtime::relax_vm::VirtualMachineImpl::InitFuncPool()
        at /mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc:676
  File "/mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc", line 676
InternalError: Check failed: (func.defined()) is false: Error: Cannot find PackedFunc paged_kv_cache.attention_kernel_prefill in either Relax VM kernel library, or in TVM runtime PackedFunc registry, or in global Relax functions of the VM executable
Traceback (most recent call last):
  File "/mnt/ssd/chenf/project/mlc_models/example/llama2/run.py", line 19, in <module>
    main()
  File "/mnt/ssd/chenf/project/mlc_models/example/llama2/run.py", line 9, in main
    cm = ChatModule(
  File "/mnt/ssd/chenf/opensource/mlc-llm/python/mlc_chat/chat_module.py", line 774, in __init__
    self._reload(self.model_lib_path, self.model_path, user_chat_config_json_str)
  File "/mnt/ssd/chenf/opensource/mlc-llm/python/mlc_chat/chat_module.py", line 988, in _reload
    self._reload_func(lib, model_path, app_config_json)
  File "/mnt/ssd/chenf/opensource/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "/mnt/ssd/chenf/opensource/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "/mnt/ssd/chenf/opensource/mlc-llm/cpp/llm_chat.cc", line 1541, in mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
    chat_->Reload(args[0], args[1], args[2]);
  File "/mnt/ssd/chenf/opensource/mlc-llm/cpp/llm_chat.cc", line 557, in mlc::llm::LLMChat::Reload(tvm::runtime::TVMArgValue, tvm::runtime::String, tvm::runtime::String)
    this->ft_.Init(reload_lib, device_, this->num_shards_);
  File "/mnt/ssd/chenf/opensource/mlc-llm/cpp/llm_chat.cc", line 160, in Init
    this->local_vm->GetFunction("vm_initialization")(
  File "/mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc", line 839, in tvm::runtime::relax_vm::VirtualMachineImpl::_Init(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
    this->Init(devices, alloc_types);
  File "/mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc", line 462, in tvm::runtime::relax_vm::VirtualMachineImpl::Init(std::vector<DLDevice, std::allocator<DLDevice> > const&, std::vector<tvm::runtime::memory::AllocatorType, std::allocator<tvm::runtime::memory::AllocatorType> > const&)
    this->InitFuncPool();
  File "/mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc", line 676, in tvm::runtime::relax_vm::VirtualMachineImpl::InitFuncPool()
    ICHECK(func.defined())
tvm.error.InternalError: Traceback (most recent call last):
  5: mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
        at /mnt/ssd/chenf/opensource/mlc-llm/cpp/llm_chat.cc:1541
  4: mlc::llm::LLMChat::Reload(tvm::runtime::TVMArgValue, tvm::runtime::String, tvm::runtime::String)
        at /mnt/ssd/chenf/opensource/mlc-llm/cpp/llm_chat.cc:557
  3: Init
        at /mnt/ssd/chenf/opensource/mlc-llm/cpp/llm_chat.cc:160
  2: tvm::runtime::relax_vm::VirtualMachineImpl::_Init(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
        at /mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc:839
  1: tvm::runtime::relax_vm::VirtualMachineImpl::Init(std::vector<DLDevice, std::allocator<DLDevice> > const&, std::vector<tvm::runtime::memory::AllocatorType, std::allocator<tvm::runtime::memory::AllocatorType> > const&)
        at /mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc:462
  0: tvm::runtime::relax_vm::VirtualMachineImpl::InitFuncPool()
        at /mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc:676
  File "/mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc", line 676
InternalError: Check failed: (func.defined()) is false: Error: Cannot find PackedFunc paged_kv_cache.attention_kernel_prefill in either Relax VM kernel library, or in TVM runtime PackedFunc registry, or in global Relax functions of the VM executable

@MasterJH5574
Copy link
Member

Thanks @SunCrazy for sharing the stack trace. Figured out the reason and am gonna work on a fix. Will report back after fixing.

@junrushao
Copy link
Member

Potential fix: #1555

@BaeBae33
Copy link
Contributor

BaeBae33 commented Jan 9, 2024

Potential fix: #1555

still crash

@junrushao
Copy link
Member

The ROCm backend is broken and I'm able to confirm with a single-line command below:

mlc_chat compile llama2_7b --opt O0 --device rocm:0 --output /tmp/lib.so --quantization q4f16_1

I submitted one of the fixes, which handles MoE operators (group GEMM) properly in our LLVM-ROCm backend: apache/tvm#16403.

Note it is still broken with failed LLVM verification, and we need a extra few changes to get it back:

  File "/home/junrushao/Projects/tvm-dev/src/target/llvm/codegen_llvm.cc", line 354
TVMError: LLVM module verification failed with the following errors:
Intrinsic has incorrect argument type!
ptr @llvm.amdgcn.ds.bpermute
Intrinsic has incorrect argument type!
ptr @llvm.amdgcn.ds.bpermute
Intrinsic has incorrect argument type!
ptr @llvm.amdgcn.ds.bpermute
Intrinsic has incorrect argument type!
ptr @llvm.amdgcn.ds.bpermute
Intrinsic has incorrect argument type!
ptr @llvm.amdgcn.ds.bpermute
Intrinsic has incorrect argument type!
ptr @llvm.amdgcn.ds.bpermute
Call parameter type does not match function signature!
float 1.000000e+04
 <4 x float>  %46 = call <4 x float> @__ocml_pow_f32(float 1.000000e+04, float %45)
Both operands to a binary operator are not of the same type!
  %47 = fdiv float %41, <4 x float> %46
Call parameter type does not match function signature!
  %47 = fdiv float %41, <4 x float> %46
 <4 x float>  %48 = call <4 x float> @__ocml_sin_f32(float %47)
fptrunc source and destination must both be a vector or neither
  %49 = fptrunc <4 x float> %48 to half
Call parameter type does not match function signature!
float 1.000000e+04
 <4 x float>  %283 = call <4 x float> @__ocml_pow_f32(float 1.000000e+04, float %282)
Both operands to a binary operator are not of the same type!
  %284 = fdiv float %278, <4 x float> %283
Call parameter type does not match function signature!
  %284 = fdiv float %278, <4 x float> %283
 <4 x float>  %285 = call <4 x float> @__ocml_cos_f32(float %284)
fptrunc source and destination must both be a vector or neither
  %286 = fptrunc <4 x float> %285 to half
Call parameter type does not match function signature!
float 1.000000e+04
 <4 x float>  %391 = call <4 x float> @__ocml_pow_f32(float 1.000000e+04, float %390)
Both operands to a binary operator are not of the same type!
  %392 = fdiv float %386, <4 x float> %391
Call parameter type does not match function signature!
  %392 = fdiv float %386, <4 x float> %391
 <4 x float>  %393 = call <4 x float> @__ocml_sin_f32(float %392)
fptrunc source and destination must both be a vector or neither
  %394 = fptrunc <4 x float> %393 to half
Call parameter type does not match function signature!
float 1.000000e+04
 <4 x float>  %748 = call <4 x float> @__ocml_pow_f32(float 1.000000e+04, float %747)
Both operands to a binary operator are not of the same type!
  %749 = fdiv float %743, <4 x float> %748
Call parameter type does not match function signature!
  %749 = fdiv float %743, <4 x float> %748
 <4 x float>  %750 = call <4 x float> @__ocml_cos_f32(float %749)
fptrunc source and destination must both be a vector or neither
  %751 = fptrunc <4 x float> %750 to half

@junrushao
Copy link
Member

CC: @tqchen @spectrometerHBH

@zhongzhenluo
Copy link

Hi I got same error with vulkan backend

@junrushao
Copy link
Member

Right. Vulkan/AMD are sharing the same LLVM layer, and thus they are broken at the same time.

@junrushao
Copy link
Member

junrushao commented Jan 15, 2024

Confirmed that this PR fixes the ROCm issue, while @spectrometerHBH is working on a fix for Vulkan rght now. I'm cherry-picking this change to mlc-ai/relax so that it goes to our nightlies starting tonight

@junrushao
Copy link
Member

Starting the next nightly (tomorrow), the following command should be recovered on ROCm:

mlc_chat chat HF://junrushao/Llama-2-7b-chat-hf-q4f16_1

@zhongzhenluo
Copy link

Thanks. I will looking forward for vulkan update

@spectrometerHBH
Copy link
Member

Thanks. I will looking forward for vulkan update

Hi, some Vulkan fixes here: apache/tvm#16405. cc if it works for you.

@junrushao
Copy link
Member

@zhongzhenluo Vulkan target is fixed by: mlc-ai/relax@9c8cf08

@TroyTzou
Copy link

TroyTzou commented Jan 18, 2024

Hi,I get the same error, but when loading MiniChat-3B-q3f16_1, I resync the github code and the result is the same.
Error message:

Python 3.10.13 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:24:38) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

from mlc_chat import ChatModule
cm = ChatModule(model="./dist/MiniChat-2-3B-q3f16_1-MLC",
----model_lib_path="./dist/libs/MiniChat-2-3B-q3f16_1-vulkan.dll",
----device="vulkan")

[17:23:11] D:\a\package\package\tvm\src\runtime\relax_vm\vm.cc:676: InternalError: Check failed: (func.defined()) is false: Error: Cannot find PackedFunc paged_kv_cache.attention_kernel_prefill in either Relax VM kernel library, or in TVM runtime PackedFunc registry, or in global Relax functions of the VM executable
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[2024-01-18 17:23:11] ERROR model_metadata.py:93: �[91mFAILED�[0m to read metadata section in legacy model lib.
Traceback (most recent call last):
File "D:\Program\Anaconda3\envs\mlc-llm-2401\lib\site-packages\mlc_chat\cli\model_metadata.py", line 91, in main
metadata = _extract_metadata(parsed.model_lib)
File "D:\Program\Anaconda3\envs\mlc-llm-2401\lib\site-packages\mlc_chat\cli\model_metadata.py", line 24, in _extract_metadata
return json.loads(VirtualMachine(load_module(model_lib), device("cpu"))"_metadata")
File "D:\Program\Anaconda3\envs\mlc-llm-2401\lib\site-packages\tvm\runtime\relax_vm.py", line 97, in init
self._setup_device(device, memory_cfg)
File "D:\Program\Anaconda3\envs\mlc-llm-2401\lib\site-packages\tvm\runtime\relax_vm.py", line 133, in _setup_device
self.module"vm_initialization"
File "D:\Program\Anaconda3\envs\mlc-llm-2401\lib\site-packages\tvm_ffi_ctypes\packed_func.py", line 239, in call
raise_last_ffi_error()
File "D:\Program\Anaconda3\envs\mlc-llm-2401\lib\site-packages\tvm_ffi\base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
File "D:\a\package\package\tvm\src\runtime\relax_vm\vm.cc", line 676
InternalError: Check failed: (func.defined()) is false: Error: Cannot find PackedFunc paged_kv_cache.attention_kernel_prefill in either Relax VM kernel library, or in TVM runtime PackedFunc registry, or in global Relax functions of the VM executable
[17:23:11] D:\a\package\package\mlc-llm\cpp\tokenizers.cc:52: Warning: Using tokenizer.model since we cannot locate tokenizer.json.
It is recommended to use tokenizer.json to ensure all token mappings are included, since currently, files like added_tokens.json, tokenizer_config.json are ignored.
Consider converting tokenizer.model to tokenizer.json by compiling the model with MLC again, or see if MLC's huggingface provides this file.
[17:23:11] D:\a\package\package\tvm\src\runtime\relax_vm\vm.cc:676: InternalError: Check failed: (func.defined()) is false: Error: Cannot find PackedFunc paged_kv_cache.attention_kernel_prefill in either Relax VM kernel library, or in TVM runtime PackedFunc registry, or in global Relax functions of the VM executable
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

Traceback (most recent call last):
File "", line 1, in
File "D:\Program\Anaconda3\envs\mlc-llm-2401\lib\site-packages\mlc_chat\chat_module.py", line 774, in init
self._reload(self.model_lib_path, self.model_path, user_chat_config_json_str)
File "D:\Program\Anaconda3\envs\mlc-llm-2401\lib\site-packages\mlc_chat\chat_module.py", line 988, in _reload
self._reload_func(lib, model_path, app_config_json)
File "D:\Program\Anaconda3\envs\mlc-llm-2401\lib\site-packages\tvm_ffi_ctypes\packed_func.py", line 239, in call
raise_last_ffi_error()
File "D:\Program\Anaconda3\envs\mlc-llm-2401\lib\site-packages\tvm_ffi\base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
File "D:\a\package\package\tvm\src\runtime\relax_vm\vm.cc", line 676
InternalError: Check failed: (func.defined()) is false: Error: Cannot find PackedFunc paged_kv_cache.attention_kernel_prefill in either Relax VM kernel library, or in TVM runtime PackedFunc registry, or in global Relax functions of the VM executable

But there is a strange thing, I load MiniChat-3B-q4f16_1RedPajama-INCITE-Chat-3B-v1-q4f16_1and RedPajama-INCITE-Chat-3B-v1-q3f16_1 is normal, there will be no error

@junrushao
Copy link
Member

This is a bug introduced by recent changes. We updated the wheel yesterday to incorporate the fix. Could you re-install the latest wheel and try again? Thanks!

@hanq-moreh
Copy link

Hi @junrushao , I'm having the same issue with Llama-2-7b-chat-hf-q4f16_1-MLC when I install the whell with following command.

python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-ai-nightly-rocm57 mlc-chat-nightly-rocm57

Can you tell me how to install the latest wheel?
It seems the recent changes is not reflected to the latest version in https://mlc.ai/wheels.

@junrushao
Copy link
Member

Does the command below work:

mlc_chat chat HF://junrushao/Llama-2-7b-chat-hf-q4f16_1

@TroyTzou
Copy link

After I updated wheel, this is now the error

from mlc_chat import ChatModule
cm = ChatModule(model="./dist/MiniChat-2-3B-q4f16_1-MLC", \
     model_lib_path="./dist/libs/MiniChat-2-3B-q4f16_1-vulkan.dll", \
    device="vulkan")

Traceback (most recent call last):
File "", line 1, in
File "D:\Program\Anaconda3\envs\mlc-llm-2401\lib\site-packages\mlc_chat\chat_module.py", line 744, in init
self.model_path, self.config_file_path = _get_model_path(model)
File "D:\Program\Anaconda3\envs\mlc-llm-2401\lib\site-packages\mlc_chat\chat_module.py", line 398, in _get_model_path
raise FileNotFoundError(
FileNotFoundError: Cannot find the model folder. We searched over the following possible paths:

I installed it

mlc_ai_nightly-0.12.dev2072-cp310-cp310-win_amd64.whl
mlc_chat_nightly-0.1.dev817-cp310-cp310-win_amd64.whl
@junrushao

@hanq-moreh
Copy link

Does the command below work:

mlc_chat chat HF://junrushao/Llama-2-7b-chat-hf-q4f16_1

with the command above, I got this error message.
But still I can chat with Llama. I'm not sure it's working right.

[2024-01-19 07:37:43] ERROR model_metadata.py:51: FAILED: Encountered dynamic shape vocab_size, need to specify `--mlc-chat-config` for memory usage calculation.
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/py_3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 152, in <module>
    main()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 146, in main
    _report_memory_usage(metadata, cfg)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 77, in _report_memory_usage
    param_shape = _read_dynamic_shape(param["shape"], config)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 57, in _read_dynamic_shape
    raise AttributeError
AttributeError

with my python code using ChatModule, I got this error.

[2024-01-19 07:41:07] ERROR model_metadata.py:134: FAILED to read metadata section in legacy model lib.
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 132, in main
    metadata = _extract_metadata(parsed.model_lib)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 24, in _extract_metadata
    return json.loads(VirtualMachine(load_module(model_lib), device("cpu"))["_metadata"]())
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/tvm/runtime/relax_vm.py", line 136, in __getitem__
    return self.module[key]
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/tvm/runtime/module.py", line 192, in __getitem__
    return self.get_function(name)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/tvm/runtime/module.py", line 176, in get_function
    raise AttributeError(f"Module has no function '{name}'")
AttributeError: Module has no function '_metadata'

@sherelynyap
Copy link

sherelynyap commented Jan 19, 2024

Does the command below work:

mlc_chat chat HF://junrushao/Llama-2-7b-chat-hf-q4f16_1

I got this (facing the same problem) using WSL+cuda

[2024-01-20 00:51:29] INFO pipeline.py:41: Compilation complete! Exporting to disk
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/interface/compile.py", line 174, in _compile
    args.build_func(
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/support/auto_target.py", line 215, in build
    relax.build(
  File "/root/miniconda3/lib/python3.11/site-packages/tvm/relax/vm_build.py", line 341, in build
    return _vmlink(
           ^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/tvm/relax/vm_build.py", line 247, in _vmlink
    lib = tvm.build(
          ^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/tvm/driver/build_module.py", line 294, in build
    rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/root/miniconda3/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/support/auto_target.py", line 260, in tvm_callback_cuda_compile
    ptx = nvcc.compile_cuda(code, target_format="fatbin")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/tvm/contrib/nvcc.py", line 120, in compile_cuda
    proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/root/miniconda3/lib/python3.11/subprocess.py", line 1950, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'nvcc'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/__main__.py", line 47, in <module>
    main()
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/__main__.py", line 24, in main
    cli.main(sys.argv[2:])
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/cli/compile.py", line 131, in main
    compile(
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/interface/compile.py", line 224, in compile
    _compile(args, model_config)
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/interface/compile.py", line 128, in _compile
    with args.target:
  File "/root/miniconda3/lib/python3.11/site-packages/tvm/target/target.py", line 145, in __exit__
    _ffi_api.TargetExitScope(self)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/root/miniconda3/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm.error.InternalError: Traceback (most recent call last):
  2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::Target)>::AssignTypedLambda<void (*)(tvm::Target)>(void (*)(tvm::Target), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  1: tvm::Target::ExitWithScope()
  0: _ZN3tvm7runtime6deta
  File "/workspace/tvm/src/target/target.cc", line 747
InternalError: Check failed: (entry->context_stack.top().same_as(*this)) is false:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/chat_module.py", line 751, in __init__
    self.model_lib_path = _get_lib_module_path(
                          ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/chat_module.py", line 575, in _get_lib_module_path
    raise FileNotFoundError(err_msg)
FileNotFoundError: Cannot find the model library that corresponds to `None`.`None` is either provided in the `chat_config` you passed in, or specified in /root/.cache/mlc_chat/model_weights/junrushao/Llama-2-7b-chat-hf-q4f16_1/mlc-chat-config.json.
We searched over the following possible paths:
- None-cuda.so
- dist/prebuilt/lib/None-cuda.so
- dist/HF://junrushao/Llama-2-7b-chat-hf-q4f16_1/None-cuda.so
- /root/.cache/mlc_chat/model_weights/junrushao/Llama-2-7b-chat-hf-q4f16_1/None-cuda.so
- /root/.cache/mlc_chat/model_weights/junrushao/None-cuda.so
If you would like to directly specify the model library path, you may consider passing in the `ChatModule.model_lib_path` parameter.
Please checkout https://github.com/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb for an example on how to load a model.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/bin/mlc_chat", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/__main__.py", line 36, in main
    cli.main(sys.argv[2:])
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/cli/chat.py", line 41, in main
    chat(
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/interface/chat.py", line 133, in chat
    cm = ChatModule(model, device, chat_config=config, model_lib_path=model_lib_path)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/chat_module.py", line 766, in __init__
    jit.jit(
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/interface/jit.py", line 112, in jit
    _run_jit(
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/interface/jit.py", line 85, in _run_jit
    subprocess.run(cmd, check=True)
  File "/root/miniconda3/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/root/miniconda3/bin/python', '-m', 'mlc_chat', 'compile', '/root/.cache/mlc_chat/model_weights/junrushao/Llama-2-7b-chat-hf-q4f16_1', '--opt', 'flashinfer=1;cublas_gemm=1;cudagraph=0', '--overrides', 'tensor_parallel_shards=1', '--device', 'cuda:0', '--output', '/tmp/tmpqj3972xu/lib.so']' returned non-zero exit status 1.

@sherelynyap
Copy link

The C++-based CLI is less maintained, and we could instead use Python APIs.

Could you instead do:

rm -rf prebuilt_libs/

And then use the Python script below:

from mlc_chat import ChatConfig, ChatModule, callback
from mlc_chat.support import logging

logging.enable_logging()

MODEL = "/mnt/ssd/chenf/project/mlc_models/example/llama2/Llama-2-7b-chat-hf-q4f16_1-MLC/"

def main():
    cm = ChatModule(
        MODEL,
        chat_config=ChatConfig(context_window_size=1024),
    )
    cm.generate(
        "What is the meaning of life?",
        progress_callback=callback.StreamToStdout(callback_interval=2),
    )

if __name__ == "__main__":
    main()

OUTPUT:

[2024-01-19 17:25:06] ERROR model_metadata.py:51: FAILED: Encountered dynamic shape vocab_size, need to specify `--mlc-chat-config` for memory usage calculation.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/cli/model_metadata.py", line 177, in <module>
    main()
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/cli/model_metadata.py", line 171, in main
    _report_memory_usage(metadata, cfg)
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/cli/model_metadata.py", line 88, in _report_memory_usage
    params_bytes, temp_func_bytes, kv_cache_bytes = _compute_memory_usage(metadata, config)
                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/cli/model_metadata.py", line 77, in _compute_memory_usage    param_shape = _read_dynamic_shape(param["shape"], config)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/mlc_chat/cli/model_metadata.py", line 57, in _read_dynamic_shape
    raise AttributeError
AttributeError

it managed to print out the reply

Ah, a question that has puzzled philosophers, theologians, and scientists for centuries! The meaning of life is a complex and multifaceted topic, and there are many different perspectives and interpretations. However, I can offer some possible answers based on various philosophical, religious, and scientific viewpoints:
1. Religious perspective: Many religious traditions teach that the meaning of life is to fulfill a divine or spiritual purpose. For example, some believe that the purpose of life is to love and serve a higher power, while others believe that it is to follow the teachings of a particular religion or to achieve spiritual enlightenment.
2. Existentialist perspective: From an existentialist viewpoint, the meaning of life is something that each individual must create for themselves. According to this perspective, life has no inherent meaning, and it is up to each person to give their life meaning and purpose through their choices, actions, and experiences.
3. Humanistic perspective:Floating point exception

@zhongzhenluo
Copy link

The C++-based CLI is less maintained, and we could instead use Python APIs.

Could you instead do:

rm -rf prebuilt_libs/

And then use the Python script below:

from mlc_chat import ChatConfig, ChatModule, callback
from mlc_chat.support import logging

logging.enable_logging()

MODEL = "/mnt/ssd/chenf/project/mlc_models/example/llama2/Llama-2-7b-chat-hf-q4f16_1-MLC/"

def main():
    cm = ChatModule(
        MODEL,
        chat_config=ChatConfig(context_window_size=1024),
    )
    cm.generate(
        "What is the meaning of life?",
        progress_callback=callback.StreamToStdout(callback_interval=2),
    )

if __name__ == "__main__":
    main()

I think I have still having the same error when run this script.
error message "Cannot find PackedFunc paged_kv_cache.attention_kernel_prefill in either Relax VM kernel library, or in TVM runtime "

@MingkangW
Copy link

Does the command below work:

mlc_chat chat HF://junrushao/Llama-2-7b-chat-hf-q4f16_1

with the command above, I got this error message. But still I can chat with Llama. I'm not sure it's working right.

[2024-01-19 07:37:43] ERROR model_metadata.py:51: FAILED: Encountered dynamic shape vocab_size, need to specify `--mlc-chat-config` for memory usage calculation.
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/py_3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 152, in <module>
    main()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 146, in main
    _report_memory_usage(metadata, cfg)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 77, in _report_memory_usage
    param_shape = _read_dynamic_shape(param["shape"], config)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 57, in _read_dynamic_shape
    raise AttributeError
AttributeError

with my python code using ChatModule, I got this error.

[2024-01-19 07:41:07] ERROR model_metadata.py:134: FAILED to read metadata section in legacy model lib.
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 132, in main
    metadata = _extract_metadata(parsed.model_lib)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 24, in _extract_metadata
    return json.loads(VirtualMachine(load_module(model_lib), device("cpu"))["_metadata"]())
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/tvm/runtime/relax_vm.py", line 136, in __getitem__
    return self.module[key]
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/tvm/runtime/module.py", line 192, in __getitem__
    return self.get_function(name)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/tvm/runtime/module.py", line 176, in get_function
    raise AttributeError(f"Module has no function '{name}'")
AttributeError: Module has no function '_metadata'

I got the same bug, did you know how to fix it?

@dusty-nv
Copy link

dusty-nv commented Feb 7, 2024

@junrushao @tqchen getting this same Cannot find PackedFunc paged_kv_cache.attention_kernel_prefill error with llama-2-7b-chat model and CUDA backend (from MLC built yesterday), using the newer mlc_chat compile SLM method. The same model/install works when using the older mlc_llm.build method to compile the model .so

from mlc_chat import ChatModule

cm = ChatModule(model='/data/models/mlc//slim/Llama-2-7b-chat-hf-q4f16_1', module_lib_path='/data/models/mlc/slim/Llama-2-7b-chat-hf-q4f16_1/Llama-2-7b-chat-hf-q4f16_1-cuda.so')

tvm.error.InternalError: Traceback (most recent call last):
  [bt] (6) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(TVMFuncCall+0x68) [0xffff7c6dac08]
  [bt] (5) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x34) [0xffff7c7ac004]
  [bt] (4) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::_Init(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x1a8) [0xffff7c7a6698]
  [bt] (3) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::InitFuncPool()+0x5b4) [0xffff7c7a5db4]
  [bt] (2) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(+0x30bf4e8) [0xffff7c79f4e8]
  [bt] (1) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x68) [0xffff7a8cce78]
  [bt] (0) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff7c722820]
  File "/opt/mlc-llm/3rdparty/tvm/src/runtime/relax_vm/vm.cc", line 676
InternalError: Check failed: (func.defined()) is false: Error: Cannot find PackedFunc paged_kv_cache.attention_kernel_prefill in either Relax VM kernel library, or in TVM runtime PackedFunc registry, or in global Relax functions of the VM executable

@CharlieFRuan
Copy link
Contributor

CharlieFRuan commented Feb 7, 2024

Hi @dusty-nv, I believe updating both TVM and mlc_chat prebuilt packages and then recompiling .so again should solve the Cannot find PackedFunc issue as shown below. If you are building from source, updating to head and rebuilding it should work (to be safe, you can rm -rf build and follow these steps again). Let me know!

(mlc-prebuilts) cfruan@catalyst-fleet:/ssd1/cfruan/mlc-llm-repos/mlc-llm-head$ python
Python 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mlc_chat
>>> print(mlc_chat)
<module 'mlc_chat' from '/ssd1/cfruan/miniconda3/envs/mlc-prebuilts/lib/python3.11/site-packages/mlc_chat/__init__.py'>
>>> import tvm
>>> print(tvm)
<module 'tvm' from '/ssd1/cfruan/miniconda3/envs/mlc-prebuilts/lib/python3.11/site-packages/tvm/__init__.py'>
>>> cm = mlc_chat.ChatModule("dist/Llama-2-7b-chat-hf-q4f16_1-MLC/", model_lib_path="dist/llama-2-7b-q4f16_1-cuda.so")
[2024-02-07 17:45:02] INFO auto_device.py:76: Found device: cuda:0
[2024-02-07 17:45:02] INFO auto_device.py:76: Found device: cuda:1
[2024-02-07 17:45:03] INFO auto_device.py:85: Not found device: rocm:0
[2024-02-07 17:45:03] INFO auto_device.py:85: Not found device: metal:0
[2024-02-07 17:45:04] INFO auto_device.py:76: Found device: vulkan:0
[2024-02-07 17:45:04] INFO auto_device.py:76: Found device: vulkan:1
[2024-02-07 17:45:04] INFO auto_device.py:76: Found device: vulkan:2
[2024-02-07 17:45:04] INFO auto_device.py:85: Not found device: opencl:0
[2024-02-07 17:45:04] INFO auto_device.py:33: Using device: cuda:0
[2024-02-07 17:45:04] INFO chat_module.py:370: Using model folder: /ssd1/cfruan/mlc-llm-repos/mlc-llm-head/dist/Llama-2-7b-chat-hf-q4f16_1-MLC
[2024-02-07 17:45:04] INFO chat_module.py:371: Using mlc chat config: /ssd1/cfruan/mlc-llm-repos/mlc-llm-head/dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json
[2024-02-07 17:45:04] INFO chat_module.py:513: Using library model: dist/llama-2-7b-q4f16_1-cuda.so
[2024-02-07 17:45:04] INFO model_metadata.py:95: Total memory usage: 4077.14 MB (Parameters: 3615.13 MB. KVCache: 0.00 MB. Temporary buffer: 462.01 MB)
[2024-02-07 17:45:04] INFO model_metadata.py:104: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
>>> cm.generate("hi")
"Hello! *adjusts glasses* It's nice to meet you. How can I assist you today? Is there something you need help with or would you like to chat?"

@CharlieFRuan
Copy link
Contributor

@MingkangW @sherelynyap The AttributeError should be fixed. mlc_chat chat HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC on my end works with the updated mlc_chat_nightly

@dusty-nv
Copy link

dusty-nv commented Feb 8, 2024

Hi @CharlieFRuan, I've rebuilt from scratch against commit 006e138 (and confirmed this change is present in the installed version of mlc_chat), and recompiled the model with the latest, and yet the same issue persists. Any ideas?

@CharlieFRuan
Copy link
Contributor

@dusty-nv Are you using the prebuilt TVM relax or building TVM from source? Either case an update for TVM would be required as well! For prebuilt, simply updating the pip package should do it; for building from source, pull head, rm -rf build, and rebuild should work.

@dusty-nv
Copy link

dusty-nv commented Feb 8, 2024

@CharlieFRuan I am building from source on ARM64+CUDA (using this Dockerfile, which re-clones the entire MLC/TVM source tree for each build, and hasn't given me prior issues with being out-of-date)

006e138 is the commit which should have fixed this bug right? I double-checked that code modification is present in the installed packages that I built.

Although, actually you say that an updated TVM is required - however the TVM Relax submodule reference in the MLC-LLM tree hasn't been updated in 5 days (mlc-ai/relax@2921370). I will try manually checking out the latest mlc-ai/relax@484d425 instead

@dusty-nv
Copy link

dusty-nv commented Feb 8, 2024

@CharlieFRuan rebuilt again with mlc-ai/relax@484d425 and still getting the same error...
EDIT: trying again with USE_FLASHINFER=ON (it would seem these symbols are related)

@CharlieFRuan
Copy link
Contributor

CharlieFRuan commented Feb 8, 2024

Ahh I can reproduce this now. I've been using https://github.com/apache/tvm. Taking a look into https://github.com/mlc-ai/relax. Apologies for the inconvenience. EDIT: actually TVM main gives same error now, I must've been doing something different; looking into it

@CharlieFRuan
Copy link
Contributor

CharlieFRuan commented Feb 8, 2024

I'm thinking about a better solution. Meanwhile, you can add these to the config.cmake for mlc-llm and tvm-unity and rebuild:

set(USE_FLASHINFER ON)
set(FLASHINFER_CUDA_ARCHITECTURES 89)
set(CMAKE_CUDA_ARCHITECTURES 89)

Where you can substitute 89 with the corresponding compute capability (needs to be 80;86;89;90 -- if not, the Cannot find Packedfunc issue will probably not be encountered in the first place). Again, sorry for the troubles

@dusty-nv
Copy link

dusty-nv commented Feb 8, 2024

Thanks @CharlieFRuan, enabling libflashinfer worked 👍

@sherelynyap
Copy link

@MingkangW @sherelynyap The AttributeError should be fixed. mlc_chat chat HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC on my end works with the updated mlc_chat_nightly

  1. i ran mlc_chat chat HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC but got this error, I dont think i have permission(?) or the repo doesn't exist
ValueError: Git clone failed with return code 128: None. The command was: ['git', 'clone', 'https://huggingface.co/junrushao/Llama-2-7b-chat-hf-q4f16_1.git', '.tmp']
  1. running my own personal weights
[2024-02-09 10:50:54] INFO auto_device.py:85: Not found device: cuda:0
[2024-02-09 10:50:54] INFO auto_device.py:85: Not found device: rocm:0
[2024-02-09 10:50:55] INFO auto_device.py:85: Not found device: metal:0
[2024-02-09 10:50:58] INFO auto_device.py:76: Found device: vulkan:0
[2024-02-09 10:50:58] INFO auto_device.py:76: Found device: vulkan:1
[2024-02-09 10:50:58] INFO auto_device.py:76: Found device: vulkan:2
[2024-02-09 10:50:58] INFO auto_device.py:76: Found device: vulkan:3
[2024-02-09 10:50:58] INFO auto_device.py:76: Found device: vulkan:4
[2024-02-09 10:50:58] INFO auto_device.py:85: Not found device: opencl:0
[2024-02-09 10:50:58] INFO auto_device.py:33: Using device: vulkan:0
[2024-02-09 10:50:58] INFO chat_module.py:370: Using model folder: C:\Users\yeoph\Documents\FYP_2023-2024\code\dist\sft_llama_finance-q4f16_1-MLC
[2024-02-09 10:50:58] INFO chat_module.py:371: Using mlc chat config: C:\Users\yeoph\Documents\FYP_2023-2024\code\dist\sft_llama_finance-q4f16_1-MLC\mlc-chat-config.json
[2024-02-09 10:50:58] INFO chat_module.py:762: Model lib not found. Now compiling model lib on device...
[2024-02-09 10:50:59] INFO jit.py:34: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-02-09 10:50:59] INFO llama_model.py:79: Overriding prefill_chunk_size from 4096 to 1024 (context_window_size)
[2024-02-09 10:50:59] INFO jit.py:93: Compiling using commands below:
[2024-02-09 10:50:59] INFO jit.py:94: 'C:\Users\yeoph\AppData\Local\Programs\Python\Python311\python.exe' -m mlc_chat compile 'dist\sft_llama_finance-q4f16_1-MLC' --opt 'flashinfer=1;cublas_gemm=1;cudagraph=0' --overrides 'context_window_size=1024;prefill_chunk_size=4096;max_batch_size=80;tensor_parallel_shards=1' --device vulkan:0 --output 'C:\Users\yeoph\AppData\Local\Temp\tmpaj6jj0mf\lib.dll'
[2024-02-09 10:51:00] INFO auto_config.py:69: Found model configuration: dist\sft_llama_finance-q4f16_1-MLC\mlc-chat-config.json
[2024-02-09 10:51:00] INFO auto_target.py:76: Detecting target device: vulkan:0
[2024-02-09 10:51:01] INFO auto_target.py:78: Found target: {"thread_warp_size": 1, "supports_float32": T.bool(True), "supports_int16": 1, "supports_int32": T.bool(True), "max_threads_per_block": 1024, "supports_int8": 1, "max_num_threads": 256, "kind": "vulkan", "max_shared_memory_per_block": 49152, "supports_16bit_buffer": 1, "tag": "", "keys": ["vulkan", "gpu"], "supports_float16": 1}
[2024-02-09 10:51:01] INFO auto_target.py:95: Found host LLVM triple: x86_64-pc-windows-msvc
[2024-02-09 10:51:01] INFO auto_target.py:96: Found host LLVM CPU: alderlake
[2024-02-09 10:51:01] INFO auto_config.py:153: Found model type: llama. Use `--model-type` to override.
Compiling with arguments:
  --config          LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=32000, position_embedding_base=10000.0, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
  --model-type      llama
  --target          {"thread_warp_size": 1, "host": {"mtriple": "x86_64-pc-windows-msvc", "tag": "", "kind": "llvm", "mcpu": "alderlake", "keys": ["cpu"]}, "supports_int16": 1, "supports_float32": T.bool(True), "supports_int32": T.bool(True), "max_threads_per_block": 1024, "supports_int8": 1, "max_num_threads": 256, "kind": "vulkan", "max_shared_memory_per_block": 49152, "supports_16bit_buffer": 1, "tag": "", "keys": ["vulkan", "gpu"], "supports_float16": 1}
  --opt             flashinfer=0;cublas_gemm=0;cudagraph=0
  --system-lib-prefix ""
  --output          C:\Users\yeoph\AppData\Local\Temp\tmpaj6jj0mf\lib.dll
  --overrides       context_window_size=1024;sliding_window_size=None;prefill_chunk_size=4096;attention_sink_size=None;max_batch_size=80;tensor_parallel_shards=1
[2024-02-09 10:51:01] INFO config.py:106: Overriding context_window_size from 4096 to 1024
[2024-02-09 10:51:01] INFO config.py:106: Overriding prefill_chunk_size from 4096 to 4096
[2024-02-09 10:51:01] INFO config.py:106: Overriding max_batch_size from 80 to 80
[2024-02-09 10:51:01] INFO config.py:106: Overriding tensor_parallel_shards from 1 to 1
[2024-02-09 10:51:01] INFO llama_model.py:79: Overriding prefill_chunk_size from 4096 to 1024 (context_window_size)
[2024-02-09 10:51:01] INFO compile.py:135: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=32000, position_embedding_base=10000.0, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
[2024-02-09 10:51:01] INFO compile.py:154: Exporting the model to TVM Unity compiler
[2024-02-09 10:51:04] INFO compile.py:160: Running optimizations using TVM Unity
[2024-02-09 10:51:04] INFO compile.py:173: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 1024, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 1024, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 0}
[2024-02-09 10:51:04] INFO pipeline.py:42: Running TVM Relax graph-level optimizations
[2024-02-09 10:51:46] INFO pipeline.py:42: Lowering to TVM TIR kernels
[2024-02-09 10:51:49] INFO pipeline.py:42: Running TVM TIR-level optimizations
[2024-02-09 10:51:56] INFO pipeline.py:42: Running TVM Dlight low-level optimizations
[10:51:56] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:56] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:56] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:56] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:56] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:56] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:56] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:56] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:56] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:56] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:56] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:57] D:\a\package\package\tvm\src\tir\schedule\./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:57] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:57] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:57] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:57] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:57] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:57] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:57] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[10:51:57] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false:  Loop variable's dtype (int32) is narrower than that of `min` or `extent` (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[2024-02-09 10:51:57] INFO pipeline.py:42: Lowering to VM bytecode
[2024-02-09 10:51:58] INFO estimate_memory_usage.py:55: [Memory usage] Function `batch_decode`: 9.02 MB
[2024-02-09 10:51:58] INFO estimate_memory_usage.py:55: [Memory usage] Function `batch_prefill`: 115.50 MB
[2024-02-09 10:51:58] INFO estimate_memory_usage.py:55: [Memory usage] Function `batch_verify`: 115.50 MB
[2024-02-09 10:51:58] INFO estimate_memory_usage.py:55: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-02-09 10:51:58] INFO estimate_memory_usage.py:55: [Memory usage] Function `decode`: 0.09 MB
[2024-02-09 10:51:58] INFO estimate_memory_usage.py:55: [Memory usage] Function `embed`: 16.00 MB
[2024-02-09 10:51:58] INFO estimate_memory_usage.py:55: [Memory usage] Function `prefill`: 96.51 MB
[2024-02-09 10:51:58] INFO estimate_memory_usage.py:55: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-02-09 10:51:59] INFO pipeline.py:42: Compiling external modules
[2024-02-09 10:51:59] INFO pipeline.py:42: Compilation complete! Exporting to disk
[10:52:01] D:\a\package\package\tvm\src\target\spirv\codegen_spirv.h:163: InternalError: Check failed: type == expected_type (float32x2 vs. float16x2) : Attempted to access buffer K_smem as element type float32x2 using an index of size 2 when the element type is float16
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\site-packages\mlc_chat\__main__.py", line 47, in <module>
    main()
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\site-packages\mlc_chat\__main__.py", line 24, in main    cli.main(sys.argv[2:])
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\site-packages\mlc_chat\cli\compile.py", line 131, in main
    compile(
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\site-packages\mlc_chat\interface\compile.py", line 226, in compile
    _compile(args, model_config)
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\site-packages\mlc_chat\interface\compile.py", line 175, in _compile
    args.build_func(
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\site-packages\mlc_chat\support\auto_target.py", line 235, in build
    relax.build(
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\site-packages\tvm\relax\vm_build.py", line 341, in build
    return _vmlink(
           ^^^^^^^^
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\site-packages\tvm\relax\vm_build.py", line 247, in _vmlink
    lib = tvm.build(
          ^^^^^^^^^^
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\site-packages\tvm\driver\build_module.py", line 294, in build
    rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\site-packages\tvm\_ffi\base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  File "D:\a\package\package\tvm\src\target\spirv\codegen_spirv.h", line 163
InternalError: Check failed: type == expected_type (float32x2 vs. float16x2) : Attempted to access buffer K_smem as element type float32x2 using an index of size 2 when the element type is float16
Traceback (most recent call last):
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\site-packages\mlc_chat\chat_module.py", line 753, in __init__
    self.model_lib_path = _get_lib_module_path(
                          ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\site-packages\mlc_chat\chat_module.py", line 575, in _get_lib_module_path
    raise FileNotFoundError(err_msg)
FileNotFoundError: Cannot find the model library that corresponds to `None`.
`None` is either provided in the `chat_config` you passed in, or specified in dist/sft_llama_finance-q4f16_1-MLC/mlc-chat-config.json.
We searched over the following possible paths:
- None-vulkan.dll
- dist/prebuilt/lib/None-vulkan.dll
- dist/dist/sft_llama_finance-q4f16_1-MLC//None-vulkan.dll
- dist/sft_llama_finance-q4f16_1-MLC/None-vulkan.dll
- C:\Users\yeoph\Documents\FYP_2023-2024\code\dist\None-vulkan.dll
If you would like to directly specify the model library path, you may consider passing in the `ChatModule.model_lib_path` parameter.
Please checkout https://github.com/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb for an example on how to load a model.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\yeoph\Documents\FYP_2023-2024\code\main.py", line 19, in <module>
    main()
  File "C:\Users\yeoph\Documents\FYP_2023-2024\code\main.py", line 9, in main
    cm = ChatModule(
         ^^^^^^^^^^^
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\site-packages\mlc_chat\chat_module.py", line 768, in __init__
    jit.jit(
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\site-packages\mlc_chat\interface\jit.py", line 122, in jit
    _run_jit(
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\site-packages\mlc_chat\interface\jit.py", line 95, in _run_jit
    subprocess.run(cmd, check=True)
  File "C:\Users\yeoph\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\\Users\\yeoph\\AppData\\Local\\Programs\\Python\\Python311\\python.exe', '-m', 'mlc_chat', 'compile', 'dist\\sft_llama_finance-q4f16_1-MLC', '--opt', 'flashinfer=1;cublas_gemm=1;cudagraph=0', '--overrides', 'context_window_size=1024;prefill_chunk_size=4096;max_batch_size=80;tensor_parallel_shards=1', '--device', 'vulkan:0', '--output', 'C:\\Users\\yeoph\\AppData\\Local\\Temp\\tmpaj6jj0mf\\lib.dll']' returned non-zero exit status 1.

@CharlieFRuan
Copy link
Contributor

@sherelynyap
For 1: I think the path was wrong, apologies for that; it should be python -m mlc_chat chat HF://junrushao/Llama-2-7b-chat-hf-q4f16_1-MLC instead (missed the -MLC before).

For 2: the issue here is Vulkan specific and illustrated in this line of your log

[10:52:01] D:\a\package\package\tvm\src\target\spirv\codegen_spirv.h:163: InternalError: Check failed: type == expected_type (float32x2 vs. float16x2) : Attempted to access buffer K_smem as element type float32x2 using an index of size 2 when the element type is float16

This is fixed in #1725; and I checked that this change is already in the newest mlc_chat_nightly.

@jpf888
Copy link

jpf888 commented Feb 19, 2024

Thanks @CharlieFRuan, enabling libflashinfer worked 👍

hi @dusty-nv @CharlieFRuan , i meet same error with you,
InternalError: Check failed: (func.defined()) is false: Error: Cannot find PackedFunc paged_kv_cache.attention_kernel_prefill in either Relax VM kernel library, or in TVM runtime PackedFunc registry, or in global Relax functions of the VM executable

Because my environment is cuda11.4+arm on orin,so I set the following options,

set(USE_FLASHINFER ON)
set(FLASHINFER_CUDA_ARCHITECTURES 87)
set(CMAKE_CUDA_ARCHITECTURES 87)

but got an error :
tvm-unity/3rdparty/flashinfer/include/flashinfer/prefill.cuh(708): error: class "cooperative_groups::__v1::thread_block" has no member "num_threads

Is there a way to solve this problem without setting flashinfer on?
ref #1555

@MasterJH5574
Copy link
Member

Hey @jpf888 thanks for reporting. May I ask which GPU you are using?

@jpf888
Copy link

jpf888 commented Feb 20, 2024

@MasterJH5574 @CharlieFRuan

Another question, when I compile and select q4f16, the following error will be reported:

[2024-02-20 16:49:51] INFO compile.py:136: Creating model from: LlavaConfig(image_token_index=32000, text_config=LlamaConfig(hidden_size=3200, intermediate_size=8640, num_attention_heads=32, num_hidden_layers=26, rms_norm_eps=1e-06, vocab_size=32000, position_embedding_base=10000, context_window_size=2048, prefill_chunk_size=2048, num_key_value_heads=32, head_dim=100, tensor_parallel_shards=1, max_batch_size=80, kwargs={'architectures': ['LlavaLlamaForCausalLM'], 'model_type': 'llama', 'torch_dtype': 'bfloat16'}), vision_config=LlavaVisionConfig(hidden_size=1024, image_size=336, intermediate_size=4096, num_attention_heads=8, num_hidden_layers=24, patch_size=14, projection_dim=3200, vocab_size=32000, dtype='float16', num_channels=3, layer_norm_eps=1e-06, kwargs={'model_type': 'clip_vision_model'}), vocab_size=32000, context_window_size=2048, prefill_chunk_size=2048, tensor_parallel_shards=1, dtype='float16', kwargs={}) [2024-02-20 16:49:52] INFO compile.py:155: Exporting the model to TVM Unity compiler Traceback (most recent call last): File "/opt/conda/envs/mlc-llm-tvm-apply/bin/mlc_chat", line 33, in <module> sys.exit(load_entry_point('mlc-chat', 'console_scripts', 'mlc_chat')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/mount_dir/mlc-llm-llava/mlc-llm/python/mlc_chat/__main__.py", line 24, in main cli.main(sys.argv[2:]) File "/workspace/mount_dir/mlc-llm-llava/mlc-llm/python/mlc_chat/cli/compile.py", line 131, in main compile( File "/workspace/mount_dir/mlc-llm-llava/mlc-llm/python/mlc_chat/interface/compile.py", line 229, in compile _compile(args, model_config) File "/workspace/mount_dir/mlc-llm-llava/mlc-llm/python/mlc_chat/interface/compile.py", line 156, in _compile mod, named_params, ext_mods = model.export_tvm( ^^^^^^^^^^^^^^^^^ File "/workspace/nfs/tvm-unity/python/tvm/relax/frontend/nn/core.py", line 479, in export_tvm mod, params, ext_mods = Exporter(debug=debug).build(spec) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/nfs/tvm-unity/python/tvm/relax/frontend/nn/exporter.py", line 136, in build outputs, inputs = _emit_method(self.builder, method_spec, params, effects) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/nfs/tvm-unity/python/tvm/relax/frontend/nn/exporter.py", line 277, in _emit_method outputs = spec.method(*explicit_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/mount_dir/mlc-llm-llava/mlc-llm/python/mlc_chat/model/llava/llava_model.py", line 503, in create_tir_paged_kv_cache return TIRPagedKVCache( ^^^^^^^^^^^^^^^^ File "/workspace/mount_dir/mlc-llm-llava/mlc-llm/python/mlc_chat/nn/kv_cache.py", line 260, in __init__ bb.add_func(_attention_prefill(num_key_value_heads, num_attention_heads, head_dim, dtype), "tir_attention_prefill"), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/mount_dir/mlc-llm-llava/mlc-llm/python/mlc_chat/nn/kv_cache.py", line 712, in _attention_prefill apply_to_gemm(sch, sch.get_block("O_gemm"), tile_o, 2, 3, k_major=False) File "/workspace/mount_dir/mlc-llm-llava/mlc-llm/python/mlc_chat/nn/kv_cache.py", line 692, in apply_to_gemm ty, tx = sch.split(t, factors=[num_warps, 32]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/nfs/tvm-unity/python/tvm/tir/schedule/_type_checker.py", line 340, in wrap return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/workspace/nfs/tvm-unity/python/tvm/tir/schedule/schedule.py", line 811, in split _ffi_api.ScheduleSplit( # type: ignore # pylint: disable=no-member File "/workspace/nfs/tvm-unity/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__ raise_last_ffi_error() File "/workspace/nfs/tvm-unity/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err tvm.tir.schedule.schedule.ScheduleError: Traceback (most recent call last): 1: tvm::tir::TracedScheduleNode::Split(tvm::tir::LoopRV const&, tvm::runtime::Array<tvm::runtime::Optional<tvm::PrimExpr>, void> const&, bool) at /workspace/nfs/tvm-unity/src/tir/schedule/traced_schedule.cc:230 0: tvm::tir::ConcreteScheduleNode::Split(tvm::tir::LoopRV const&, tvm::runtime::Array<tvm::runtime::Optional<tvm::PrimExpr>, void> const&, bool) at /workspace/nfs/tvm-unity/src/tir/schedule/concrete_schedule.cc:506 ScheduleError: An error occurred in the schedule primitive 'split'.

My model has hiddensize 3200, num_attention_heads=32, so the head size is 100, and this error will be reported. When I use q4f32 to compile successfully,

The reason may be if it is q4f16, create_tir_paged_kv_cache will be performed, and the head size 100 of TIRPagedKVCache _attention_decode is not supported??

same error i had met in Tensorrt-LLM, kernel size not surport 100, but 112 is ok

@jpf888
Copy link

jpf888 commented Feb 22, 2024

@MasterJH5574 hi Can you give me some ideas? thanks!

@dusty-nv
Copy link

hi @dusty-nv
Because my environment is cuda11.4+arm on orin,so I set the following options,
but got an error : tvm-unity/3rdparty/flashinfer/include/flashinfer/prefill.cuh(708): error: class "cooperative_groups::__v1::thread_block" has no member "num_threads

Hi @jpf888, I had not been able to compile MLC/TVM with CUDA 11.4/JetPack 5 since around 51fb0f4 , for newer I am using CUDA 12.2 / JetPack 6

@jpf888
Copy link

jpf888 commented Feb 22, 2024

hi @dusty-nv
Because my environment is cuda11.4+arm on orin,so I set the following options,
but got an error : tvm-unity/3rdparty/flashinfer/include/flashinfer/prefill.cuh(708): error: class "cooperative_groups::__v1::thread_block" has no member "num_threads

Hi @jpf888, I had not been able to compile MLC/TVM with CUDA 11.4/JetPack 5 since around 51fb0f4 , for newer I am using CUDA 12.2 / JetPack 6

thank you very much i got it。

@sjtu-scx
Copy link

@sherelynyap For 1: I think the path was wrong, apologies for that; it should be python -m mlc_chat chat HF://junrushao/Llama-2-7b-chat-hf-q4f16_1-MLC instead (missed the -MLC before).

For 2: the issue here is Vulkan specific and illustrated in this line of your log

[10:52:01] D:\a\package\package\tvm\src\target\spirv\codegen_spirv.h:163: InternalError: Check failed: type == expected_type (float32x2 vs. float16x2) : Attempted to access buffer K_smem as element type float32x2 using an index of size 2 when the element type is float16

This is fixed in #1725; and I checked that this change is already in the newest mlc_chat_nightly.

hello!when I running mlc_chat compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json --device vulkan -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-vulkan.dll,I get the same errors as 2. Can you help me?

Errors:
[2024-02-23 18:58:43] INFO auto_config.py:69: Found model configuration: dist\Llama-2-7b-chat-hf-q4f16_1-MLC\mlc-chat-config.json
[2024-02-23 18:58:46] INFO auto_device.py:76: Found device: vulkan:0
[2024-02-23 18:58:46] INFO auto_target.py:63: Found configuration of target device "vulkan:0": {"thread_warp_size": 1, "supports_float32": T.bool(True), "supports_int16": 1, "supports_int32": T.bool(True), "max_threads_per_block": 1024, "supports_int8": 1, "max_num_threads": 256, "kind": "vulkan", "max_shared_memory_per_block": 49152, "supports_16bit_buffer": 1, "tag": "", "keys": ["vulkan", "gpu"], "supports_float16": 1}
[2024-02-23 18:58:46] INFO auto_target.py:95: Found host LLVM triple: x86_64-pc-windows-msvc
[2024-02-23 18:58:46] INFO auto_target.py:96: Found host LLVM CPU: alderlake
[2024-02-23 18:58:46] INFO auto_config.py:153: Found model type: llama. Use --model-type to override.
Compiling with arguments:
--config LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=32000, position_embedding_base=10000, context_window_size=768, prefill_chunk_size=768, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
--quantization GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
--model-type llama
--target {"thread_warp_size": 1, "host": {"mtriple": "x86_64-pc-windows-msvc", "tag": "", "kind": "llvm", "mcpu": "alderlake", "keys": ["cpu"]}, "supports_int16": 1, "supports_float32": T.bool(True), "supports_int32": T.bool(True), "max_threads_per_block": 1024, "supports_int8": 1, "max_num_threads": 256, "kind": "vulkan", "max_shared_memory_per_block": 49152, "supports_16bit_buffer": 1, "tag": "", "keys": ["vulkan", "gpu"], "supports_float16": 1}
--opt flashinfer=0;cublas_gemm=0;cudagraph=0
--system-lib-prefix ""
--output dist\libs\Llama-2-7b-chat-hf-q4f16_1-vulkan.dll
--overrides context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None
[2024-02-23 18:58:46] INFO compile.py:135: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=32000, position_embedding_base=10000, context_window_size=768, prefill_chunk_size=768, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
[2024-02-23 18:58:46] INFO compile.py:154: Exporting the model to TVM Unity compiler
[2024-02-23 18:58:51] INFO compile.py:160: Running optimizations using TVM Unity
[2024-02-23 18:58:51] INFO compile.py:173: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 768, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 768, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 0}
[2024-02-23 18:58:51] INFO pipeline.py:42: Running TVM Relax graph-level optimizations
[2024-02-23 18:59:44] INFO pipeline.py:42: Lowering to TVM TIR kernels
[2024-02-23 18:59:48] INFO pipeline.py:42: Running TVM TIR-level optimizations
[2024-02-23 18:59:57] INFO pipeline.py:42: Running TVM Dlight low-level optimizations
[18:59:57] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:57] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:57] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:57] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:58] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:58] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:58] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:58] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:58] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:58] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:58] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:58] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:58] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of min or extent (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:58] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of min or extent (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:58] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of min or extent (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:58] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of min or extent (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:58] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of min or extent (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:58] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of min or extent (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:58] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of min or extent (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[18:59:58] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of min or extent (int64)
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

[2024-02-23 18:59:58] INFO pipeline.py:42: Lowering to VM bytecode
[2024-02-23 19:00:00] INFO estimate_memory_usage.py:55: [Memory usage] Function batch_decode: 9.02 MB
[2024-02-23 19:00:00] INFO estimate_memory_usage.py:55: [Memory usage] Function batch_prefill: 86.62 MB
[2024-02-23 19:00:00] INFO estimate_memory_usage.py:55: [Memory usage] Function batch_verify: 86.62 MB
[2024-02-23 19:00:00] INFO estimate_memory_usage.py:55: [Memory usage] Function create_tir_paged_kv_cache: 0.00 MB
[2024-02-23 19:00:00] INFO estimate_memory_usage.py:55: [Memory usage] Function decode: 0.09 MB
[2024-02-23 19:00:00] INFO estimate_memory_usage.py:55: [Memory usage] Function embed: 12.00 MB
[2024-02-23 19:00:00] INFO estimate_memory_usage.py:55: [Memory usage] Function prefill: 72.38 MB
[2024-02-23 19:00:00] INFO estimate_memory_usage.py:55: [Memory usage] Function softmax_with_temperature: 0.00 MB
[2024-02-23 19:00:01] INFO pipeline.py:42: Compiling external modules
[2024-02-23 19:00:01] INFO pipeline.py:42: Compilation complete! Exporting to disk
[19:00:04] D:\a\package\package\tvm\src\target\spirv\codegen_spirv.h:163: InternalError: Check failed: type == expected_type (float32x2 vs. float16x2) : Attempted to access buffer K_smem as element type float32x2 using an index of size 2 when the element type is float16
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

@MasterJH5574
Copy link
Member

@sjtu-scx This issue is supposed to have been addressed by apache/tvm#16554 and #1725. Could you update to the latest TVM and MLC and try it again?

@MasterJH5574
Copy link
Member

Gonna close this due to inactivity. Please open a new issue if there are any other problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bugs
Projects
None yet
Development

No branches or pull requests