Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] pass SYCL CI #10041

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

[SYCL] pass SYCL CI #10041

wants to merge 2 commits into from

Conversation

airMeng
Copy link
Collaborator

@airMeng airMeng commented Oct 25, 2024

fix some issues in norm and permute GEMM

There are instability within the cloud instance, so only keep the basic functionality in SYCL CI to save the machine time

@airMeng
Copy link
Collaborator Author

airMeng commented Oct 25, 2024

@ggerganov how to trigger ggml-7-sycl?

@ggerganov
Copy link
Owner

Add the string ggml-ci somewhere in the commit message. You can git commit --amend, add ggml-ci and then force push to trigger the CI.

WA for permute(0,1,3,2) mul_mat
ggml-ci
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Oct 25, 2024
@airMeng
Copy link
Collaborator Author

airMeng commented Oct 25, 2024

@arthw it is wield that test-quantize-fns is a pure host side executable and it can pass under debug build but failed under debug model. Is it possible that it is an ICX/ICPX issue?

@github-actions github-actions bot added the testing Everything test related label Oct 25, 2024
tests/test-quantize-fns.cpp Outdated Show resolved Hide resolved
@airMeng
Copy link
Collaborator Author

airMeng commented Oct 25, 2024

@ggerganov only ggml-7-sycl not triggered, is it offline?

@ggerganov
Copy link
Owner

Looks like the same connectivity issues as before:

# when trying to SSH to the instance
channel 0: open failed: connect failed: No route to host
stdio forwarding failed
Connection closed by UNKNOWN port 65535

The instance is shown as "Ready" in the Intel Cloud.

@airMeng
Copy link
Collaborator Author

airMeng commented Oct 25, 2024

Looks like the same connectivity issues as before:

# when trying to SSH to the instance
channel 0: open failed: connect failed: No route to host
stdio forwarding failed
Connection closed by UNKNOWN port 65535

The instance is shown as "Ready" in the Intel Cloud.

Please let us know if this happens from time to time? Sorry for inconvenience, ITDC is still in developments.

@ggerganov
Copy link
Owner

It seemed to happen quite often - usually about 10 minutes after I manage to SSH. I've reported at least 3 occurrences to Intel support on 2 different machines, but no root cause was found. Yesterday it looked like the issue was gone because I was able to SSH and setup the CI successfully without disconnects. But now it seems the issue is back. Note that I wasn't connected or doing anything on the machine. The last CI run was ~4 hours ago from the commits in this PR: https://github.com/ggml-org/ci/commits/results/. Will let you know if the connection comes back again.

@airMeng
Copy link
Collaborator Author

airMeng commented Oct 25, 2024

seems the machine hang there and now it is back.

@ggerganov could you share your command, for example, a script ro run llama.cpp CI with one click that the cloud engineers could debug it, either here or through the email? we think this should be a simple issue but one we hadn't considered before

@ggerganov
Copy link
Owner

Yes, will send you an email in about half an hour

@github-actions github-actions bot added the devops improvements to build systems and github actions label Nov 6, 2024
@airMeng
Copy link
Collaborator Author

airMeng commented Nov 6, 2024

@ggerganov solved the connection issue, you might need to re-request another instance. Sorry for the inconvenience, you can use this PR to test the CI again.

@ggerganov
Copy link
Owner

Thank you. Will try to set it up today. Will let you know.

@ggerganov
Copy link
Owner

@airMeng I just configured the new instance today and as I was just finishing the deployment of ggml-ci I again got disconnected and can no longer use the instance:

channel 0: open failed: connect failed: No route to host
stdio forwarding failed
Connection closed by UNKNOWN port 65535

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants