Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: always create events for split buffers #10185

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

JohannesGaessler
Copy link
Collaborator

Fixes #10176 .

I think the correct way to fix it is to just create the events unconditionally. Regardless of how the data is split you always need the events on the currently active device for the other devices to wait on. You could maybe reduce the number of events by only initializing those that are actually needed but I don't think that would be worthwhile since for the vast majority of use cases all events are already being created and used anyways.

@JohannesGaessler JohannesGaessler added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Nov 5, 2024
@slaren
Copy link
Collaborator

slaren commented Nov 5, 2024

Qwen2.5-0.5B does not work with this change alone, it still crashes in the memcpy later:

CUDA error: invalid argument
  current device: 1, in function ggml_cuda_op_mul_mat at ggml/src/ggml-cuda.cu:1583
  cudaMemcpyPeerAsync( src1_ddq_i, id, src1_ddq_i_source, ctx.device, src1_ncols*src1_padded_col_size*q8_1_ts/q8_1_bs, stream)

@slaren
Copy link
Collaborator

slaren commented Nov 5, 2024

It would also be possible to prevent using a split buffer entirely if the matrix is too small by returning false in the supports_op check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Speculative Decoding "Segmentation fault (core dumped)"
2 participants