Skip to content
This repository was archived by the owner on Nov 1, 2024. It is now read-only.
This repository was archived by the owner on Nov 1, 2024. It is now read-only.

I change Num_head of OPT-1.3b,and it cause CUDA Error: IndexSelectLargeIndex,  #751

Open
@Gusicun

Description

@Gusicun

🐛 Bug

To Reproduce

My data process is fine ,but when i come to train the data, it broken during some steps.I used colossalai to train this ,an i only change the NUM_Head in model-config.json

Code sample

../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [160,0,0], thread: [24,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [160,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [160,0,0], thread: [26,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [160,0,0], thread: [27,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [160,0,0], thread: [28,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [160,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [160,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [160,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.

Expected behavior

Environment

  • metaseq Version (e.g., 1.0 or master):
  • PyTorch Version (e.g., 1.0)
  • OS (e.g., Linux, Windows, MacOS):
  • How you installed metaseq (pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions