I change Num_head of OPT-1.3b,and it cause CUDA Error: IndexSelectLargeIndex, #751
Description
🐛 Bug
To Reproduce
My data process is fine ,but when i come to train the data, it broken during some steps.I used colossalai to train this ,an i only change the NUM_Head in model-config.json
Code sample
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [160,0,0], thread: [24,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [160,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [160,0,0], thread: [26,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [160,0,0], thread: [27,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [160,0,0], thread: [28,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [160,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [160,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [160,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize
failed.
Expected behavior
Environment
- metaseq Version (e.g., 1.0 or master):
- PyTorch Version (e.g., 1.0)
- OS (e.g., Linux, Windows, MacOS):
- How you installed metaseq (
pip
, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information: