Skip to content

Conversation

sahil9001
Copy link

@sahil9001 sahil9001 commented Sep 19, 2025

Summary

Fix intermittent SeaweedFS S3 auth race causing artifact upload failures in CI (“Signed request requires setting up SeaweedFS S3 authentication”).

Root Cause

Argo uploaded signed S3 requests before SeaweedFS finished configuring S3 users. Auth setup previously ran in a separate Job, creating a timing window.

Changes

  • SeaweedFS deployment:
    • Added envFrom to read accesskey/secretkey from mlpipeline-minio-artifact.
    • Added lifecycle.postStart to:
      • Wait for SeaweedFS readiness.
      • Create mlpipeline bucket (idempotent).
      • Run s3.configure -user kubeflow-admin -access_key $accesskey -secret_key $secretkey -actions Admin -apply.
  • Removed the separate init-seaweedfs Job (seaweedfs-create-admin-user-job.yaml).

Impact

  • Eliminates race; S3 auth is ready as soon as the SeaweedFS pod serves.
  • Stabilizes CI artifact uploads with signed requests.
  • Keeps authenticated access (no anonymous fallback).

Testing

  • Manifests lint clean.
  • postStart runs inside the SeaweedFS container ensuring ordering; bucket creation is idempotent.

Rollout

  • No config changes required; uses existing mlpipeline-minio-artifact secret.
  • Deploy manifests; SeaweedFS pod will self-configure on startup.

Risks / Mitigations

  • Risk: postStart failure could delay readiness.
  • Mitigation: readiness wait, idempotent bucket creation, single configure apply, explicit keys from the secret.

Checklist:

Copy link

Hi @sahil9001. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

🚫 This command cannot be processed. Only organization members or owners can use the commands.

@juliusvonkohout
Copy link
Member

@sahil9001 thank you for the PR. Can you fix the identation and remove the spaces you added.

@akagami-harsh @pschoen-itsc for review

@sahil9001
Copy link
Author

Thanks @juliusvonkohout. I have updated the code

@pschoen-itsc @akagami-harsh can you check?

@juliusvonkohout
Copy link
Member

/ok-to-test
@HumairAK are the tests broken on the master branch?

@sahil9001
Copy link
Author

@juliusvonkohout I see the tests are failing on the master from few commits, but passes intermittently. Is that expected?

@pschoen-itsc
Copy link
Contributor

pschoen-itsc commented Sep 24, 2025

seaweedfs-create-admin-user-job.yaml is still referenced in kustomization.yaml

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign juliusvonkohout for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@pschoen-itsc pschoen-itsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@VaniHaripriya
Copy link
Contributor

@sahil9001 Could you please take a look at the intermittent PostStartHookError failures during the SeaweedFS deployment.

@sahil9001
Copy link
Author

@sahil9001 Could you please take a look at the intermittent PostStartHookError failures during the SeaweedFS deployment.

Sure @VaniHaripriya , can you tell me how can I test this locally? Is there any command to replicate those tests locally to check the correctness?

@VaniHaripriya
Copy link
Contributor

@sahil9001 Could you please take a look at the intermittent PostStartHookError failures during the SeaweedFS deployment.

@sahil9001 You can create a Kind cluster and run the following to mirror what CI does:

sh .github/resources/scripts/deploy-kfp.sh --storage seaweedfs

@sahil9001
Copy link
Author

sahil9001 commented Sep 25, 2025

@VaniHaripriya I am getting this issue while running :

.github/resources/scripts/helper-functions.sh: line 60: pip: command not found
.github/resources/scripts/helper-functions.sh: line 61: python: command not found
Deploy unsuccessful. Not all pods running.

I already have python and pip installed on my system, but still I am facing this issue.

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Sep 26, 2025

@sahil9001 you can also install it as shown here to replicate it locally https://github.com/kubeflow/manifests/blob/8cc8dcfcb749bf50b2d525c01f13ec91d14ea258/.github/workflows/full_kubeflow_integration_test.yaml#L70

You can even just raise a dummy PR against kubeflow/manifests to use the testing infrastructure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants