CUDA op getrows fails for long sequences #11189

milot-mirdita · 2025-01-11T09:25:00Z

I have integrated the ProstT5 protein language into Foldseek. Thanks a lot for the great library! I am upstreaming a few fixes for issues I found in ggml during the integration. I hope that it's okay to push the changes here and that they get synced at some point to the main ggml repo.

The T5 encoder has a square input pos tensor (llm_build_pos_bucket(cause = false)) which quickly exceeds the 65k limit (on most GPUs?) of the CUDA GET_ROWS op.

lctx.inp_pos_bucket = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_tokens, n_tokens);

I have implemented this only for the _float op and I don't feel very confident in CUDA programming. I tested this one specifically for my use-case against the reference implementation of my model, but don't have models ready to test for quantized versions.

T5 embeddings have a square input pos tensor which quickly exceeds the 65k limit of getrows Implemented only for _float, need other implementations

JohannesGaessler · 2025-01-11T10:21:14Z

ggml/src/ggml-cuda/getrows.cu

+    static const int64_t MAX_GRID_Y = 65535;
+    for (int64_t startY = 0; startY < ne10; startY += MAX_GRID_Y) {


This code is incorrect. The grid y dimension uses 16 bits and ranges from 0 to 65535 (inclusive). So the correct stride would be 65536. With this code two threads per grid write to the same address (though this should result in identical results). The correct way to fix this would be to modify the CUDA kernel and have it iterate with a stride of 65536 over the y dimension. This will also avoid issues with the number of nodes in a CUDA graph varying depending on input parameters.

CUDA op getrows fails for long sequences

6acdb26

T5 embeddings have a square input pos tensor which quickly exceeds the 65k limit of getrows Implemented only for _float, need other implementations

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 11, 2025

JohannesGaessler reviewed Jan 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA op getrows fails for long sequences #11189

CUDA op getrows fails for long sequences #11189

milot-mirdita commented Jan 11, 2025

JohannesGaessler Jan 11, 2025

		static const int64_t MAX_GRID_Y = 65535;
		for (int64_t startY = 0; startY < ne10; startY += MAX_GRID_Y) {

CUDA op getrows fails for long sequences #11189

Are you sure you want to change the base?

CUDA op getrows fails for long sequences #11189

Conversation

milot-mirdita commented Jan 11, 2025

JohannesGaessler Jan 11, 2025

Choose a reason for hiding this comment