-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] [llama2-7B] fail to execute Llama-2-7b-chat-hf-q4f16_1-MLC #1551
Comments
The C++-based CLI is less maintained, and we could instead use Python APIs. Could you instead do: rm -rf prebuilt_libs/ And then use the Python script below: from mlc_chat import ChatConfig, ChatModule, callback
from mlc_chat.support import logging
logging.enable_logging()
MODEL = "/mnt/ssd/chenf/project/mlc_models/example/llama2/Llama-2-7b-chat-hf-q4f16_1-MLC/"
def main():
cm = ChatModule(
MODEL,
chat_config=ChatConfig(context_window_size=1024),
)
cm.generate(
"What is the meaning of life?",
progress_callback=callback.StreamToStdout(callback_interval=2),
)
if __name__ == "__main__":
main() |
@junrushao Thanks for your reply! I change to execute the python script above, but get new errors: I guess I didn't compile a dependency when compiling TVM, right? [2024-01-07 17:08:31] INFO auto_device.py:76: Found device: cuda:0
[2024-01-07 17:08:32] INFO auto_device.py:85: Not found device: rocm:0
[2024-01-07 17:08:32] INFO auto_device.py:85: Not found device: metal:0
[2024-01-07 17:08:32] INFO auto_device.py:85: Not found device: vulkan:0
[2024-01-07 17:08:33] INFO auto_device.py:85: Not found device: opencl:0
[2024-01-07 17:08:33] INFO auto_device.py:33: Using device: cuda:0
[2024-01-07 17:08:33] INFO chat_module.py:366: Using model folder: /mnt/ssd/chenf/project/mlc_models/example/llama2/Llama-2-7b-chat-hf-q4f16_1-MLC
[2024-01-07 17:08:33] INFO chat_module.py:367: Using mlc chat config: /mnt/ssd/chenf/project/mlc_models/example/llama2/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json
[2024-01-07 17:08:33] INFO chat_module.py:756: Model lib not found. Now compiling model lib on device...
[2024-01-07 17:08:33] INFO llama_model.py:79: Overriding prefill_chunk_size from 4096 to 1024 (context_window_size)
[2024-01-07 17:08:33] INFO jit.py:83: Compiling using commands below:
[2024-01-07 17:08:33] INFO jit.py:84: /mnt/ssd/chenf/software/miniconda3/envs/relax-py10/bin/python -m mlc_chat compile /mnt/ssd/chenf/project/mlc_models/example/llama2/Llama-2-7b-chat-hf-q4f16_1-MLC --opt 'flashinfer=1;cublas_gemm=1;cudagraph=0' --overrides 'context_window_size=1024;prefill_chunk_size=4096;tensor_parallel_shards=1' --device cuda:0 --output /tmp/tmpqjedea0v/lib.so
[2024-01-07 17:08:34] INFO auto_config.py:69: Found model configuration: /mnt/ssd/chenf/project/mlc_models/example/llama2/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json
[2024-01-07 17:08:34] INFO auto_target.py:75: Detecting target device: cuda:0
[2024-01-07 17:08:34] INFO auto_target.py:77: Found target: {"thread_warp_size": 32, "arch": "sm_70", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]}
[2024-01-07 17:08:34] INFO auto_target.py:94: Found host LLVM triple: x86_64-conda-linux-gnu
[2024-01-07 17:08:34] INFO auto_target.py:95: Found host LLVM CPU: skylake-avx512
[2024-01-07 17:08:34] INFO auto_target.py:242: Generating code for CUDA architecture: sm_70
[2024-01-07 17:08:34] INFO auto_target.py:243: To produce multi-arch fatbin, set environment variable MLC_MULTI_ARCH. Example: MLC_MULTI_ARCH=70,72,75,80,86,87,89,90
[2024-01-07 17:08:34] INFO auto_config.py:151: Found model type: llama. Use `--model-type` to override.
[2024-01-07 17:08:34] WARNING compiler_flags.py:67: flashinfer is not supported on CUDA arch < 80
Compiling with arguments:
--config LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-06, vocab_size=32000, position_embedding_base=10000, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, max_batch_size=1, kwargs={})
--quantization GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
--model-type llama
--target {"thread_warp_size": 32, "host": {"mtriple": "x86_64-conda-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "skylake-avx512", "keys": ["cpu"]}, "arch": "sm_70", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]}
--opt flashinfer=0;cublas_gemm=0;cudagraph=0
--system-lib-prefix ""
--output /tmp/tmpqjedea0v/lib.so
--overrides context_window_size=1024;sliding_window_size=None;prefill_chunk_size=4096;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1
[2024-01-07 17:08:34] INFO compiler_flags.py:118: Overriding context_window_size from 4096 to 1024
[2024-01-07 17:08:34] INFO compiler_flags.py:118: Overriding prefill_chunk_size from 4096 to 4096
[2024-01-07 17:08:34] INFO compiler_flags.py:118: Overriding tensor_parallel_shards from 1 to 1
[2024-01-07 17:08:34] INFO llama_model.py:79: Overriding prefill_chunk_size from 4096 to 1024 (context_window_size)
[2024-01-07 17:08:34] INFO compile.py:131: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-06, vocab_size=32000, position_embedding_base=10000, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, max_batch_size=1, kwargs={})
[2024-01-07 17:08:34] INFO compile.py:141: Exporting the model to TVM Unity compiler
[2024-01-07 17:08:39] INFO compile.py:147: Running optimizations using TVM Unity
[2024-01-07 17:08:39] INFO compile.py:160: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 1024, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 1024, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 536870912}
[2024-01-07 17:08:39] INFO pipeline.py:35: Running TVM Relax graph-level optimizations
[2024-01-07 17:08:42] INFO pipeline.py:35: Lowering to TVM TIR kernels
[2024-01-07 17:08:48] INFO pipeline.py:35: Running TVM TIR-level optimizations
[2024-01-07 17:08:57] INFO pipeline.py:35: Running TVM Dlight low-level optimizations
[2024-01-07 17:09:09] INFO pipeline.py:35: Lowering to VM bytecode
[2024-01-07 17:09:10] INFO estimate_memory_usage.py:55: [Memory usage] Function `_initialize_effect`: 0.00 MB
[2024-01-07 17:09:10] INFO estimate_memory_usage.py:55: [Memory usage] Function `batch_decode`: 0.09 MB
[2024-01-07 17:09:10] INFO estimate_memory_usage.py:55: [Memory usage] Function `batch_prefill`: 88.51 MB
[2024-01-07 17:09:10] INFO estimate_memory_usage.py:55: [Memory usage] Function `create_flashinfer_paged_kv_cache`: 0.00 MB
[2024-01-07 17:09:10] INFO estimate_memory_usage.py:55: [Memory usage] Function `decode`: 8.36 MB
[2024-01-07 17:09:10] INFO estimate_memory_usage.py:55: [Memory usage] Function `embed`: 8.00 MB
[2024-01-07 17:09:10] INFO estimate_memory_usage.py:55: [Memory usage] Function `prefill`: 245.63 MB
[2024-01-07 17:09:10] INFO estimate_memory_usage.py:55: [Memory usage] Function `softmax_with_temperature`: 0.12 MB
[2024-01-07 17:09:11] INFO pipeline.py:35: Compiling external modules
[2024-01-07 17:09:11] INFO pipeline.py:35: Compilation complete! Exporting to disk
[2024-01-07 17:09:22] INFO compile.py:175: Generated: /tmp/tmpqjedea0v/lib.so
[2024-01-07 17:09:22] INFO jit.py:87: Using compiled model lib: /iothome/chenf/.cache/mlc_chat/model_lib/beca1ee3070ef1b4cd8bbddad9a1c09d.so
[2024-01-07 17:09:23] ERROR model_metadata.py:93: FAILED to read metadata section in legacy model lib.
Traceback (most recent call last):
File "/mnt/ssd/chenf/opensource/mlc-llm/python/mlc_chat/cli/model_metadata.py", line 91, in main
metadata = _extract_metadata(parsed.model_lib)
File "/mnt/ssd/chenf/opensource/mlc-llm/python/mlc_chat/cli/model_metadata.py", line 24, in _extract_metadata
return json.loads(VirtualMachine(load_module(model_lib), device("cpu"))["_metadata"]())
File "/mnt/ssd/chenf/opensource/tvm/python/tvm/runtime/relax_vm.py", line 97, in __init__
self._setup_device(device, memory_cfg)
File "/mnt/ssd/chenf/opensource/tvm/python/tvm/runtime/relax_vm.py", line 133, in _setup_device
self.module["vm_initialization"](*init_args)
File "/mnt/ssd/chenf/opensource/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
raise_last_ffi_error()
File "/mnt/ssd/chenf/opensource/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
File "/mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc", line 839, in tvm::runtime::relax_vm::VirtualMachineImpl::_Init(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
this->Init(devices, alloc_types);
File "/mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc", line 462, in tvm::runtime::relax_vm::VirtualMachineImpl::Init(std::vector<DLDevice, std::allocator<DLDevice> > const&, std::vector<tvm::runtime::memory::AllocatorType, std::allocator<tvm::runtime::memory::AllocatorType> > const&)
this->InitFuncPool();
File "/mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc", line 676, in tvm::runtime::relax_vm::VirtualMachineImpl::InitFuncPool()
ICHECK(func.defined())
tvm.error.InternalError: Traceback (most recent call last):
2: tvm::runtime::relax_vm::VirtualMachineImpl::_Init(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
at /mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc:839
1: tvm::runtime::relax_vm::VirtualMachineImpl::Init(std::vector<DLDevice, std::allocator<DLDevice> > const&, std::vector<tvm::runtime::memory::AllocatorType, std::allocator<tvm::runtime::memory::AllocatorType> > const&)
at /mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc:462
0: tvm::runtime::relax_vm::VirtualMachineImpl::InitFuncPool()
at /mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc:676
File "/mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc", line 676
InternalError: Check failed: (func.defined()) is false: Error: Cannot find PackedFunc paged_kv_cache.attention_kernel_prefill in either Relax VM kernel library, or in TVM runtime PackedFunc registry, or in global Relax functions of the VM executable
Traceback (most recent call last):
File "/mnt/ssd/chenf/project/mlc_models/example/llama2/run.py", line 19, in <module>
main()
File "/mnt/ssd/chenf/project/mlc_models/example/llama2/run.py", line 9, in main
cm = ChatModule(
File "/mnt/ssd/chenf/opensource/mlc-llm/python/mlc_chat/chat_module.py", line 774, in __init__
self._reload(self.model_lib_path, self.model_path, user_chat_config_json_str)
File "/mnt/ssd/chenf/opensource/mlc-llm/python/mlc_chat/chat_module.py", line 988, in _reload
self._reload_func(lib, model_path, app_config_json)
File "/mnt/ssd/chenf/opensource/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
raise_last_ffi_error()
File "/mnt/ssd/chenf/opensource/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
File "/mnt/ssd/chenf/opensource/mlc-llm/cpp/llm_chat.cc", line 1541, in mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
chat_->Reload(args[0], args[1], args[2]);
File "/mnt/ssd/chenf/opensource/mlc-llm/cpp/llm_chat.cc", line 557, in mlc::llm::LLMChat::Reload(tvm::runtime::TVMArgValue, tvm::runtime::String, tvm::runtime::String)
this->ft_.Init(reload_lib, device_, this->num_shards_);
File "/mnt/ssd/chenf/opensource/mlc-llm/cpp/llm_chat.cc", line 160, in Init
this->local_vm->GetFunction("vm_initialization")(
File "/mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc", line 839, in tvm::runtime::relax_vm::VirtualMachineImpl::_Init(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
this->Init(devices, alloc_types);
File "/mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc", line 462, in tvm::runtime::relax_vm::VirtualMachineImpl::Init(std::vector<DLDevice, std::allocator<DLDevice> > const&, std::vector<tvm::runtime::memory::AllocatorType, std::allocator<tvm::runtime::memory::AllocatorType> > const&)
this->InitFuncPool();
File "/mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc", line 676, in tvm::runtime::relax_vm::VirtualMachineImpl::InitFuncPool()
ICHECK(func.defined())
tvm.error.InternalError: Traceback (most recent call last):
5: mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
at /mnt/ssd/chenf/opensource/mlc-llm/cpp/llm_chat.cc:1541
4: mlc::llm::LLMChat::Reload(tvm::runtime::TVMArgValue, tvm::runtime::String, tvm::runtime::String)
at /mnt/ssd/chenf/opensource/mlc-llm/cpp/llm_chat.cc:557
3: Init
at /mnt/ssd/chenf/opensource/mlc-llm/cpp/llm_chat.cc:160
2: tvm::runtime::relax_vm::VirtualMachineImpl::_Init(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
at /mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc:839
1: tvm::runtime::relax_vm::VirtualMachineImpl::Init(std::vector<DLDevice, std::allocator<DLDevice> > const&, std::vector<tvm::runtime::memory::AllocatorType, std::allocator<tvm::runtime::memory::AllocatorType> > const&)
at /mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc:462
0: tvm::runtime::relax_vm::VirtualMachineImpl::InitFuncPool()
at /mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc:676
File "/mnt/ssd/chenf/opensource/tvm/src/runtime/relax_vm/vm.cc", line 676
InternalError: Check failed: (func.defined()) is false: Error: Cannot find PackedFunc paged_kv_cache.attention_kernel_prefill in either Relax VM kernel library, or in TVM runtime PackedFunc registry, or in global Relax functions of the VM executable |
Thanks @SunCrazy for sharing the stack trace. Figured out the reason and am gonna work on a fix. Will report back after fixing. |
Potential fix: #1555 |
still crash |
The ROCm backend is broken and I'm able to confirm with a single-line command below: mlc_chat compile llama2_7b --opt O0 --device rocm:0 --output /tmp/lib.so --quantization q4f16_1 I submitted one of the fixes, which handles MoE operators (group GEMM) properly in our LLVM-ROCm backend: apache/tvm#16403. Note it is still broken with failed LLVM verification, and we need a extra few changes to get it back:
|
Hi I got same error with vulkan backend |
Right. Vulkan/AMD are sharing the same LLVM layer, and thus they are broken at the same time. |
Confirmed that this PR fixes the ROCm issue, while @spectrometerHBH is working on a fix for Vulkan rght now. I'm cherry-picking this change to mlc-ai/relax so that it goes to our nightlies starting tonight |
Starting the next nightly (tomorrow), the following command should be recovered on ROCm: mlc_chat chat HF://junrushao/Llama-2-7b-chat-hf-q4f16_1 |
Thanks. I will looking forward for vulkan update |
Hi, some Vulkan fixes here: apache/tvm#16405. cc if it works for you. |
@zhongzhenluo Vulkan target is fixed by: mlc-ai/relax@9c8cf08 |
Hi,I get the same error, but when loading
But there is a strange thing, I load |
This is a bug introduced by recent changes. We updated the wheel yesterday to incorporate the fix. Could you re-install the latest wheel and try again? Thanks! |
Hi @junrushao , I'm having the same issue with
Can you tell me how to install the latest wheel? |
Does the command below work: mlc_chat chat HF://junrushao/Llama-2-7b-chat-hf-q4f16_1 |
After I updated wheel, this is now the error
I installed it
|
with the command above, I got this error message. [2024-01-19 07:37:43] ERROR model_metadata.py:51: FAILED: Encountered dynamic shape vocab_size, need to specify `--mlc-chat-config` for memory usage calculation.
Traceback (most recent call last):
File "/opt/conda/envs/py_3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/py_3.9/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 152, in <module>
main()
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 146, in main
_report_memory_usage(metadata, cfg)
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 77, in _report_memory_usage
param_shape = _read_dynamic_shape(param["shape"], config)
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 57, in _read_dynamic_shape
raise AttributeError
AttributeError with my python code using [2024-01-19 07:41:07] ERROR model_metadata.py:134: FAILED to read metadata section in legacy model lib.
Traceback (most recent call last):
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 132, in main
metadata = _extract_metadata(parsed.model_lib)
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/mlc_chat/cli/model_metadata.py", line 24, in _extract_metadata
return json.loads(VirtualMachine(load_module(model_lib), device("cpu"))["_metadata"]())
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/tvm/runtime/relax_vm.py", line 136, in __getitem__
return self.module[key]
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/tvm/runtime/module.py", line 192, in __getitem__
return self.get_function(name)
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/tvm/runtime/module.py", line 176, in get_function
raise AttributeError(f"Module has no function '{name}'")
AttributeError: Module has no function '_metadata' |
I got this (facing the same problem) using WSL+cuda
|
OUTPUT:
it managed to print out the reply
|
I think I have still having the same error when run this script. |
I got the same bug, did you know how to fix it? |
@junrushao @tqchen getting this same from mlc_chat import ChatModule
cm = ChatModule(model='/data/models/mlc//slim/Llama-2-7b-chat-hf-q4f16_1', module_lib_path='/data/models/mlc/slim/Llama-2-7b-chat-hf-q4f16_1/Llama-2-7b-chat-hf-q4f16_1-cuda.so')
tvm.error.InternalError: Traceback (most recent call last):
[bt] (6) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(TVMFuncCall+0x68) [0xffff7c6dac08]
[bt] (5) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x34) [0xffff7c7ac004]
[bt] (4) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::_Init(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x1a8) [0xffff7c7a6698]
[bt] (3) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::InitFuncPool()+0x5b4) [0xffff7c7a5db4]
[bt] (2) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(+0x30bf4e8) [0xffff7c79f4e8]
[bt] (1) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x68) [0xffff7a8cce78]
[bt] (0) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff7c722820]
File "/opt/mlc-llm/3rdparty/tvm/src/runtime/relax_vm/vm.cc", line 676
InternalError: Check failed: (func.defined()) is false: Error: Cannot find PackedFunc paged_kv_cache.attention_kernel_prefill in either Relax VM kernel library, or in TVM runtime PackedFunc registry, or in global Relax functions of the VM executable |
Hi @dusty-nv, I believe updating both TVM and mlc_chat prebuilt packages and then recompiling
|
@MingkangW @sherelynyap The |
Hi @CharlieFRuan, I've rebuilt from scratch against commit 006e138 (and confirmed this change is present in the installed version of mlc_chat), and recompiled the model with the latest, and yet the same issue persists. Any ideas? |
@dusty-nv Are you using the prebuilt TVM relax or building TVM from source? Either case an update for TVM would be required as well! For prebuilt, simply updating the pip package should do it; for building from source, pull head, |
@CharlieFRuan I am building from source on ARM64+CUDA (using this Dockerfile, which re-clones the entire MLC/TVM source tree for each build, and hasn't given me prior issues with being out-of-date) 006e138 is the commit which should have fixed this bug right? I double-checked that code modification is present in the installed packages that I built. Although, actually you say that an updated TVM is required - however the TVM Relax submodule reference in the MLC-LLM tree hasn't been updated in 5 days (mlc-ai/relax@2921370). I will try manually checking out the latest mlc-ai/relax@484d425 instead |
@CharlieFRuan rebuilt again with mlc-ai/relax@484d425 and still getting the same error... |
Ahh I can reproduce this now. I've been using https://github.com/apache/tvm. Taking a look into https://github.com/mlc-ai/relax. Apologies for the inconvenience. EDIT: actually TVM main gives same error now, I must've been doing something different; looking into it |
I'm thinking about a better solution. Meanwhile, you can add these to the
Where you can substitute |
Thanks @CharlieFRuan, enabling libflashinfer worked 👍 |
|
@sherelynyap For 2: the issue here is Vulkan specific and illustrated in this line of your log
This is fixed in #1725; and I checked that this change is already in the newest mlc_chat_nightly. |
hi @dusty-nv @CharlieFRuan , i meet same error with you, Because my environment is cuda11.4+arm on orin,so I set the following options, set(USE_FLASHINFER ON) but got an error : Is there a way to solve this problem without setting flashinfer on? |
Hey @jpf888 thanks for reporting. May I ask which GPU you are using? |
Another question, when I compile and select q4f16, the following error will be reported:
My model has hiddensize 3200, num_attention_heads=32, so the head size is 100, and this error will be reported. When I use q4f32 to compile successfully, The reason may be if it is q4f16, create_tir_paged_kv_cache will be performed, and the head size 100 of TIRPagedKVCache _attention_decode is not supported?? same error i had met in Tensorrt-LLM, kernel size not surport 100, but 112 is ok |
@MasterJH5574 hi Can you give me some ideas? thanks! |
Hi @jpf888, I had not been able to compile MLC/TVM with CUDA 11.4/JetPack 5 since around 51fb0f4 , for newer I am using CUDA 12.2 / JetPack 6 |
thank you very much i got it。 |
hello!when I running Errors: [18:59:57] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule [18:59:57] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule [18:59:57] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule [18:59:58] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule [18:59:58] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule [18:59:58] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule [18:59:58] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule [18:59:58] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule [18:59:58] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule [18:59:58] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule [18:59:58] D:\a\package\package\tvm\src\tir\schedule./concrete_schedule.h:287: ValueError: The block no longer exists in the IRModule [18:59:58] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of [18:59:58] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of [18:59:58] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of [18:59:58] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of [18:59:58] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of [18:59:58] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of [18:59:58] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of [18:59:58] D:\a\package\package\tvm\src\tir\ir\stmt.cc:122: InternalError: Check failed: (e.dtype().bits() <= loop_var.dtype().bits()) is false: Loop variable's dtype (int32) is narrower than that of [2024-02-23 18:59:58] INFO pipeline.py:42: Lowering to VM bytecode |
@sjtu-scx This issue is supposed to have been addressed by apache/tvm#16554 and #1725. Could you update to the latest TVM and MLC and try it again? |
Gonna close this due to inactivity. Please open a new issue if there are any other problems. |
🐛 Bug
When I execute Llama-2-7b-chat-hf-q4f16_1-MLC, some errors are raised as follow:
To Reproduce
Steps to reproduce the behavior:
/mnt/ssd/chenf/opensource/mlc-llm/build/mlc_chat_cli --model-lib-path prebuilt_libs/Llama-2-7b-chat-hf-q4f16_1-cuda.so --model Llama-2-7b-chat-hf-q4f16_1-MLC/
Environment
conda
, source): sourcepip
, source): sourcepython -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):The text was updated successfully, but these errors were encountered: