-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make kebechet respond to release tickets on all failures #629
Comments
We already mentioned in the release issue that the person trying to create the release is not a maintainer. |
@tumido are you good with this behavior? can we close this issue? |
@goern please don't close, I don't think we understand each other here 🙂 @saisankargochhayat yeah, that's true and that's precisely why I've excluded those cases in the description, see:
In our case the issue was hard to triage because Kebechet failed to push the tag, since it was already released outside of Kebechet via As you can see we've been very much in blind of what's happening and kebechet didn't tell us why it failed. This ticket is precisely for such occasions of anticipated failures, not about this exact failure type. My ask here is if we can make kebechet report the status every time, in any failure case. |
From what I understand, in a scenario like this a comment on the release issue is what we want - |
Well, I don't think we should pay attention to this particular cause, it's not about fancy reporting on specific narrow reason of failure. This should be about bare old school reporting for any kind of failure. If the bot can provide any insight into what happen, it will mean the bot saved us from filing 3 more triage trial issues. A link to the job run, failed steps, log of the step, whatever - that's what I'd like to see, the "debug" data. |
So as a general principle for any exception we encounter, we do put an issue comment indicating the user, this seems like a corner case, where the release was manually created instead of using kebechet, which messed up things. But feel free to let me know if you can find anywhere else that reporting the error could be helpful to the user. Souce code link - https://github.com/thoth-station/kebechet/blob/master/kebechet/managers/version/version.py Maybe it's a good idea, to write in the version manager's documentation stating at any point you manually release it's a good idea to ensure that the source code version string and the tag release both indicate the same version. |
I think a more generic failure handling would be appreciated on the user side. Right now I'm debugging another issue, with a different repo where the release process failed silently (the release PR has been opened, the git tag was pushed, yet no image was delivered to quay). There's no message from any of the bots on any of the issues (sesheta even closed the release issue as a success). See for yourself: aicoe-aiops/mailing-list-analysis-toolkit#24 https://quay.io/repository/aicoe/mailing-list-analysis-toolkit?tab=tags I wasn't able to locate the Tekton pipeline responsible for that release, so I've triggered the "Deliver container image" issue pipeline for the missing image: The build failed on some networking error (now I know that it was a networking issue, since I was able to locate the Tekton job): Yet the bots are still silent on the issues. This is not about a single corner case. This is more about a generic "safety measures" e.g. I face any error, I report it. Can we make AICoE-CI do that please? cc @harshad16 |
@tumido thanks for pointing it out, on side of aicoe-ci, we are trying to get this message to the user either on the PR or the issue opened. There are some changes to be made to get this to a state where is more convenient for the user to get more information. we will try to get these details for the user. on the topic of kebechet, the feature that can be useful is responding to the issues of why it is stale, the reason is that the pod running the kebechet run has failed, but as it failed, there is no message relayed all the way to GitHub issue, we should plan on managing this, either by reporting error traceback to the Github issue or pr for that we would have to monitor the exceptions or via a sidecar container which responds the GitHub issue with the log of the failed main container. |
@harshad16 I know it's hard to catch every possibility and I know aicoe-ci is doing its best and I'm totally rooting for you! Yet we're still pushing the limits and demands and opening new issues... 😄 The sidecar container or something sounds like a wonderful idea (it also sounds like a lot of work)! Looking forward to the bright future 👍 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale @harshad16 what is the status of this? |
So.. Since I always wanted to learn the CI ropes, I've experimented with this when building by own CI for the OperateFirst slack bot... I think a comment like from the bots would be enough: I'm updating the same comment in various stages of the CI with the most recent actions taken. It helps me understand which workflow and at which step it got stuck. If it would be possible to have something like this for AICoE-CI, I think it would be a huge jump forward in usability. |
I'm all in for more chatops, as long as we keep it accessible to us Red Hats using Google Chat ;) Shall we send out event from the CI to a Kafka topic and have different consumers send messages to slack or gchat? |
/priority backlog |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
/remove-lifecycle rotten |
@goern: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
/remove-lifecycle rotten |
/assign goern |
/sig user-experience |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
/lifecycle frozen |
Is your feature request related to a problem? Please describe.
Releasing via kebechet is very convenient and straightforward when it works. When it doesn't and it's not related to permissions (user is not a maintainer and such) it's really hard to triage it.
Describe the solution you'd like
Get an error message if any of the build/release steps fails.
Describe alternatives you've considered
n/a
Additional context
aicoe-aiops/categorical-encoding#16
The text was updated successfully, but these errors were encountered: