Skip to content

add failed state #132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 17, 2024
Merged

add failed state #132

merged 2 commits into from
Jun 17, 2024

Conversation

davixcky
Copy link
Contributor

@davixcky davixcky commented Jun 17, 2024

Screenshot 2024-06-17 at 9 30 34 AM

@daniel-de-vera
Copy link
Contributor

@davixcky, this looks good to me, but I would extend the info we display in the other states as well.
I had this PR #130, but it looks like it won't get merged, so I would suggest adding the logic from below in this PR:

fmt.Fprintf(tw, "Job Name:\t%s\n", job.Name)
fmt.Fprintf(tw, "Job Runner Group:\t%s\n", job.Spec.RunnerGroup)
fmt.Fprintf(tw, "Status:\t%s\n", getJobStatus(job))
if state := job.Status.Attempts[0].State; state != nil {
switch {
case state.Queued != nil:
fmt.Fprintf(tw, "Message:\t%s\n", state.Queued.Message)
case state.Running != nil:
fmt.Fprintf(tw, "Runner Pod:\t%s/%s\n", state.Running.PodNamespace, state.Running.PodName)
case state.Canceled != nil:
fmt.Fprintf(tw, "Canceled By:\t%s\n", state.Canceled.CanceledBy)
fmt.Fprintf(tw, "Message:\t%s\n", state.Canceled.Message)
case state.Failed != nil:
if state.Failed.ExitCode != nil {
fmt.Fprintf(tw, "Exit Code:\t%d\n", *state.Failed.ExitCode)
}
fmt.Fprintf(tw, "Message:\t%s\n", state.Failed.Message)
}
}

WDYT?

@davixcky
Copy link
Contributor Author

@davixcky, this looks good to me, but I would extend the info we display in the other states as well. I had this PR #130, but it looks like it won't get merged, so I would suggest adding the logic from below in this PR:

fmt.Fprintf(tw, "Job Name:\t%s\n", job.Name)
fmt.Fprintf(tw, "Job Runner Group:\t%s\n", job.Spec.RunnerGroup)
fmt.Fprintf(tw, "Status:\t%s\n", getJobStatus(job))
if state := job.Status.Attempts[0].State; state != nil {
switch {
case state.Queued != nil:
fmt.Fprintf(tw, "Message:\t%s\n", state.Queued.Message)
case state.Running != nil:
fmt.Fprintf(tw, "Runner Pod:\t%s/%s\n", state.Running.PodNamespace, state.Running.PodName)
case state.Canceled != nil:
fmt.Fprintf(tw, "Canceled By:\t%s\n", state.Canceled.CanceledBy)
fmt.Fprintf(tw, "Message:\t%s\n", state.Canceled.Message)
case state.Failed != nil:
if state.Failed.ExitCode != nil {
fmt.Fprintf(tw, "Exit Code:\t%d\n", *state.Failed.ExitCode)
}
fmt.Fprintf(tw, "Message:\t%s\n", state.Failed.Message)
}
}

WDYT?

Makes sense, updated

daniel-de-vera
daniel-de-vera previously approved these changes Jun 17, 2024
Copy link
Contributor

@daniel-de-vera daniel-de-vera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@daniel-de-vera daniel-de-vera dismissed their stale review June 17, 2024 14:48

It seems there is a bug in the API, hold on merging this changes

@daniel-de-vera
Copy link
Contributor

This doesn't look good:

$ go run ./cmd/signadot/ job get basic-ipmqgdi
Job Name:           basic-ipmqgdi
Job Runner Group:   basic
Status:             Running
Runner Pod:         26/10        <============
Environment:        baseline
Created At:         about a minute ago
Started At:         about a minute ago
Dashboard URL:      https://app.signadot.com/testing/jobs/basic-ipmqgdi

Artifacts
No artifacts

I checked the API is returning invalid info:

$ sdcurl -e dev /api/v2/orgs/ddvcorp/jobs/basic-wx31lgq
GET http://localhost:8080/api/v2/orgs/ddvcorp/jobs/basic-wx31lgq

200 OK in 7.263833ms

{
  "name": "basic-wx31lgq",
  "createdAt": "2024-06-17T14:45:14Z",
  "deletedAt": "",
  "spec": {
    "namePrefix": "basic",
    "runnerGroup": "basic",
    "script": "#!/bin/bash\n\necho \"Hello this is the demo meeting\"\nstart=`date +%s`\n\nx=1\nwhile [ $x -le 30 ]\ndo\n  echo \"Welcome $x times (env TEST=$TEST)\"\n  x=$(( $x + 1 ))\n  echo \"Error $x: retrying\" 1\u003e\u00262\n  sleep 1\ndone\necho \"This is an artifact\" \u003e aaa.txt\necho \"We are done!\"\n\nend=`date +%s`\nruntime=$((end-start))\necho $runtime\n",
    "routingContext": null,
    "uploadArtifact": [
      {
        "path": "aaa.txt",
        "meta": {
          "format": "text"
        }
      }
    ]
  },
  "status": {
    "attempts": [
      {
        "id": 0,
        "createdAt": "2024-06-17T14:45:14Z",
        "sentToClusterBy": "2024-06-17T14:45:16Z",
        "startedAt": "2024-06-17T14:45:16Z",
        "phase": "running",
        "state": {
          "running": {
            "jobExecutorAddr": "10.244.26.50:3773",
            "podName": "10",             <========================
            "podNamespace": "26"         <========================
          }
        },
        "executionCount": 1,
        "tries": [
          {
            "seqNumber": 34
          }
        ]
      }
    ],
    "seqNum": 34
  }
}

Let's check what's going on before merging this one (maybe those fields are not intended to be there)

@daniel-de-vera
Copy link
Contributor

The above issue is related to https://github.com/signadot/signadot/pull/4574.
We should discuss how to approach this.

@scott-cotton
Copy link
Member

(maybe those fields are not intended to be there)

They are intended to be there, so that a user can access to the pod via kubectl.

I missed the interaction with the DNS caching fix though.

@daniel-de-vera
Copy link
Contributor

(maybe those fields are not intended to be there)

They are intended to be there, so that a user can access to the pod via kubectl.

I missed the interaction with the DNS caching fix though.

I guess the right thing to do here would be to report that in the job CRD, wdty @scott-cotton?

@scott-cotton
Copy link
Member

(maybe those fields are not intended to be there)

They are intended to be there, so that a user can access to the pod via kubectl.
I missed the interaction with the DNS caching fix though.

I guess the right thing to do here would be to report that in the job CRD, wdty @scott-cotton?

Feels less accessible if not available via the saas resource.

@daniel-de-vera
Copy link
Contributor

Feels less accessible if not available via the saas resource.

I'm proposing to keep it in the SaaS resource, but instead of parsing that info out of the JobExecutionAddr from job CRD, read it from explicit fields.

@scott-cotton
Copy link
Member

Feels less accessible if not available via the saas resource.

I'm proposing to keep it in the SaaS resource, but instead of parsing that info out of the JobExecutionAddr from job CRD, read it from explicit fields.

yes, that sounds right to me.

@daniel-de-vera
Copy link
Contributor

yes, that sounds right to me.

I can make this change, after the planning meeting.

@scott-cotton
Copy link
Member

yes, that sounds right to me.

I can make this change, after the planning meeting.

(but there is also the question of operator compatibility)

@daniel-de-vera
Copy link
Contributor

yes, that sounds right to me.

I can make this change, after the planning meeting.

(but there is also the question of operator compatibility)

Here is a PR that fixes the issue: https://github.com/signadot/signadot/pull/4575
No changes in the operator were required.

Copy link
Contributor

@daniel-de-vera daniel-de-vera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being fixed the API issue, LGTM

@davixcky davixcky merged commit 3cb2119 into main Jun 17, 2024
@davixcky davixcky deleted the add-failed-state branch June 17, 2024 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants