-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docs] add sample RayCluster using kube-rbac-proxy for dashboard access control #2578
[Docs] add sample RayCluster using kube-rbac-proxy for dashboard access control #2578
Conversation
How to test: Apply manifest $ kubectl apply -f ray-operator/config/samples/ray-cluster.auth.yaml
configmap/kube-rbac-proxy created
serviceaccount/kube-rbac-proxy created
clusterrolebinding.rbac.authorization.k8s.io/kube-rbac-proxy created
clusterrole.rbac.authorization.k8s.io/kube-rbac-proxy created
raycluster.ray.io/ray-cluster-with-auth created Check cluster $ kubectl get po
NAME READY STATUS RESTARTS AGE
ray-cluster-with-auth-head-bfmj6 2/2 Running 0 30s
ray-cluster-with-auth-worker-group-worker-8msfl 1/1 Running 0 30s
ray-cluster-with-auth-worker-group-worker-969zs 1/1 Running 0 30s
ray-cluster-with-auth-worker-group-worker-bb7z5 1/1 Running 0 30s
ray-cluster-with-auth-worker-group-worker-hnvrw 1/1 Running 0 30s Create session: $ kubectl ray session ray-cluster-with-auth &
Ray Dashboard: http://localhost:8265
Ray Interactive Client: http://localhost:10001
Forwarding from 127.0.0.1:8265 -> 8265
Forwarding from [::1]:8265 -> 8265
Forwarding from 127.0.0.1:10001 -> 10001
Forwarding from [::1]:10001 -> 10001 Submit a job, expect 401
Use a dummy token with no access, expect 403 $ kubectl create serviceaccount dummy
serviceaccount/dummy created
$ export RAY_JOB_HEADERS="{\"Authorization\": \"Bearer $(kubectl create token dummy)\"}"
$ ray job submit --working-dir . -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
Handling connection for 8265
Handling connection for 8265
Job submission server address: http://localhost:8265
Handling connection for 8265
2024-11-26 23:11:54,039 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_7cdffea633c32bf1.zip.
2024-11-26 23:11:54,041 INFO packaging.py:530 -- Creating a file package for local directory '.'.
Handling connection for 8265
Traceback (most recent call last):
File "/usr/local/google/home/andrewsy/code/python/.env/bin/ray", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/scripts/scripts.py", line 2612, in main
return cli()
^^^^^
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/job/cli.py", line 273, in submit
job_id = client.submit_job(
^^^^^^^^^^^^^^^^^^
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/job/sdk.py", line 214, in submit_job
self._upload_working_dir_if_needed(runtime_env)
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 398, in _upload_working_dir_if_needed
upload_working_dir_if_needed(runtime_env, upload_fn=_upload_fn)
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/_private/runtime_env/working_dir.py", line 98, in upload_working_dir_if_needed
upload_fn(working_dir, excludes=excludes)
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 391, in _upload_fn
self._upload_package_if_needed(
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 377, in _upload_package_if_needed
self._upload_package(
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 358, in _upload_package
self._raise_error(r)
File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 283, in _raise_error
raise RuntimeError(
RuntimeError: Request failed with status code 403: Forbidden (user=system:serviceaccount:default:dummy, verb=update, resource=, subresource=)
.
Use my personal access token with admin access:
|
spec: | ||
containers: | ||
- name: ray-worker | ||
image: rayproject/ray:2.34.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using Ray 2.39.0 instead, in case Ray 2.34.0 has an issue with the dashboard agent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
containers: | ||
- name: ray-worker | ||
image: rayproject/ray:2.34.0 | ||
resources: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use fewer resources to ensure users can follow the example in their environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
spec: | ||
headGroupSpec: | ||
rayStartParams: | ||
dashboard-host: '0.0.0.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can user directly send a request to 8843?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you can still send requests to 8443. You would have to use NetworkPolicy or similar to ensure only port 8265 is used.
I tried to bind all processes in the Head container to 127.0.0.1
, but I think that broke some communication with the Raylets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nvm, I made a mistake in my testing. dashboard-host: 127.0.0.1
actually works, so users cannot directly access port 8443. Only way would be execing into the container or port-forwarding
memory: "4Gi" | ||
readinessProbe: | ||
httpGet: | ||
path: "/api/gcs_healthz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may cause issues when the raylet or dashboard agent crashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I'll update to use exec probes, but we should revisit #2360
513be21
to
c7d26bd
Compare
Signed-off-by: Andrew Sy Kim <[email protected]>
c7d26bd
to
63479be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess KubeRay also requires some updates. For example, the HTTP requests in RayJob
/RayService
from the KubeRay operator to the Ray dashboard. Are these follow-up PRs?
Would you mind opening an umbrella issue to track the progress for the entire KubeRay K8s-native auth. Thanks!
template: | ||
metadata: | ||
spec: | ||
serviceAccountName: kube-rbac-proxy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that when the autoscaler is enabled, KubeRay will automatically reconcile RBAC resources if serviceAccountName
is not specified.
The current configuration may cause the autoscaler to not work. It's fine for this YAML, but it is worth mentioning in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, for the autoscaler case, we'll need to add additional resources in the autoscaler ClusterRole
Yes, these are things we'll need to automate eventually for RayJob / RayService. But I think authentication for RayJob / RayService is not a high priority because users typically don't use them interatively like RayCluster. Authentication is more important for long-lived RayCluster that are used interactively, in this scenario I don't think we need to make any changes in KubeRay, except for automating the installation of the sidecar eventually |
command: | ||
- bash | ||
- -c | ||
- wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
open an issue to avoid users configuring the probes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these changes needed?
This adds an example RayCluster using https://github.com/brancz/kube-rbac-proxy for access control to the Ray dashboard.
There will be a follow-up PR to reference this example in a guide / tutorial.
Related issue number
Checks