-
Notifications
You must be signed in to change notification settings - Fork 103
WIP: sessiongate controller #3679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: geoberle The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
49402f2 to
619a687
Compare
sessiongate is a controller that manages secure breakglass sessions * credential minting * authn/z for session access * mise integration for bearer token validation * istio authorizationpolicy integration for session access gating * HCP KAS proxy management Signed-off-by: Gerd Oberlechner <[email protected]>
619a687 to
c4367bf
Compare
| // This package imports things required by build scripts, to force `go mod` to see them as dependencies | ||
| package tools | ||
|
|
||
| import _ "k8s.io/code-generator" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should do this with bingo
|
|
||
| // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="spec is immutable" | ||
| type SessionSpec struct { | ||
| // ttl is the time-to-live duration for the session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calculated from what point? creation timestamp of the object? establishment of the session?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
creation of the session CR. it needs to be from a geneva action runtime perspective to be comprehendable by customers. i will document this better
| TTL metav1.Duration `json:"ttl"` | ||
|
|
||
| // managementCluster specifies the AKS management cluster | ||
| // +kubebuilder:validation:Required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these tags will apply if you have them one block higher - so:
// +kubebuilder:validation:Required
// managementCluster specifies the AKS management cluster
ManagementCluster ManagementCluster `json:"managementCluster"`
this also keeps them from showing up in your GoDoc. I prefer that style
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // ManagementCluster identifies an Azure management cluster | ||
| type ManagementCluster struct { | ||
| // resourceId is the Azure resource ID of the management cluster | ||
| // +kubebuilder:validation:Required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look into using CEL for immutable constraints - do you really want to allow mutations to this field after creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i misunderstood the CEL statement. i thought having it on spec is sufficiently guarding against modification. i will adapt
| // +kubebuilder:validation:Pattern=`^/subscriptions/[a-fA-F0-9-]+/resourceGroups/[^/]+/providers/[^/]+/[^/]+/[^/]+$` | ||
| ResourceID string `json:"resourceId"` | ||
|
|
||
| // namespace of the HostedControlPlane CR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this something we can figure out by knowing the HCP resource ID? all we need is the CS cluster ID, which is not something the user needs to know anyway
| } | ||
|
|
||
| type AccessLevel struct { | ||
| // group is the name of the access group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is insufficient godoc for me to understand what this field does - what are valid values - how do they map into the cluster - etc
| } | ||
|
|
||
| type SessionStatus struct { | ||
| // INSERT ADDITIONAL STATUS FIELD - define observed state of cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover from sample-controller :D
| // +optional | ||
| CredentialsSecretRef string `json:"credentialsSecretRef,omitempty"` | ||
|
|
||
| // backendKASURL is the Kubernetes API server URL for the backend cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this different from the endpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
endpoint is the ingress from an SRE
backend is where the proxy sends data to, which is
- the public KAS url if the cluster is public
- something else (cluster internal port-forward endpoint) for private clusters
| BackendKASURL string `json:"backendKASURL,omitempty"` | ||
| } | ||
|
|
||
| func (s *Session) IsReady() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: put these into some _conditions.go file for ease of reading types
|
|
||
| func (r conditionsImpl) findUnhappyDependent() *metav1.Condition { | ||
| // This only works if there are dependents. | ||
| if len(r.dependants) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: dependant
|
|
||
| // conditionsImpl implements the helper methods for evaluating Conditions. | ||
| // +k8s:deepcopy-gen=false | ||
| type conditionsImpl struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't seem related to the v1alpha1 api (to be useful it must apply to many apis generally)- consider moving to some other package in your project
| "github.com/Azure/ARO-HCP/sessiongate/pkg/controller" | ||
| ) | ||
|
|
||
| // buildAuthorizationPolicyApplyConfig creates an ApplyConfiguration for Server-Side Apply. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: every ApplyConfiguration is for SSA - describe why & how, not what
| }, | ||
| } | ||
|
|
||
| // Build ApplyConfiguration using fluent builder pattern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
low-value comment that should be removed
| } | ||
|
|
||
| // Build ApplyConfiguration using fluent builder pattern | ||
| //nolint:govet // copylocks: protobuf message contains a mutex internally; builder copies it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what protobuf message? even if true, duplicate it, don't copy locks
| return false, fmt.Errorf("failed to apply AuthorizationPolicy: %w", err) | ||
| } | ||
|
|
||
| // determine if a change was made |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have the existing policy, instead of always sending the mutating call, why not diff the intended config and exit early if there's nothing to do?
| } | ||
|
|
||
| // if the policy exists, check if it is owned by the session | ||
| existingPolicy, err := c.authzPoliciesLister.AuthorizationPolicies(c.sessionNamespace).Get(policyName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is racy - nothing stops someone from changing ownership between this read and the mutating apply - why is ownership necessary? when will it not be the case? can you factor the logic such that you only work on correctly-owned policies in the frst place?
|
|
||
| package controlplane | ||
|
|
||
| // Condition types for Session resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of your API, put it in that module
| // control plane controller implementation for Session resources. | ||
| // it runs with leader election and handles Session reconciliation into | ||
| // - istio AuthorizationPolicies | ||
| // - session credentials secrets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at the validating admission policies used in the secret sync controller or the acr pull binding controller for how to secure this component from having root on the cluster
|
|
||
| utilruntime.Must(sessiongateschema.AddToScheme(scheme.Scheme)) | ||
|
|
||
| ratelimiter := workqueue.NewTypedMaxOfRateLimiter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure where this came from necessarily, consider the defaults (seen in example controller, etc)
queue: workqueue.NewTypedRateLimitingQueueWithConfig(
workqueue.DefaultTypedControllerRateLimiter[gitBranchKey](),
workqueue.TypedRateLimitingQueueConfig[gitBranchKey]{
Name: GitControllerName,
},
),
(or justify variation from the default)
| return nil, fmt.Errorf("failed to add event handler for sessions (control plane): %w", err) | ||
| } | ||
|
|
||
| // Secret Informer for control plane |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you add the VAP, you won't be able to do this anymore (that's ok)
| UpdateFunc: func(old, new interface{}) { | ||
| newPolicy := new.(*securityv1beta1.AuthorizationPolicy) | ||
| oldPolicy := old.(*securityv1beta1.AuthorizationPolicy) | ||
| if newPolicy.ResourceVersion == oldPolicy.ResourceVersion { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
|
|
||
| newSecret := new.(*corev1.Secret) | ||
| oldSecret := old.(*corev1.Secret) | ||
| if newSecret.ResourceVersion == oldSecret.ResourceVersion { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
| return fmt.Errorf("failed to wait for caches to sync") | ||
| } | ||
|
|
||
| sessions, err := c.sessionsLister.Sessions(c.sessionNamespace).List(labels.Everything()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? the informer should handle it
| return true | ||
| } | ||
|
|
||
| func (c *Controller) workCeremony(ctx context.Context, logger klog.Logger, objRef cache.ObjectName) (time.Duration, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ceremony" is not a common term here
| } | ||
|
|
||
| sessionCopy := session.DeepCopy() | ||
| sessionCopy.InitializeConditions() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this smells - why do you need to do this?
|
|
||
| requeueAfter, err := c.syncHandler(ctx, logger, sessionCopy) | ||
|
|
||
| if patchErr := c.patchSessionStatus(ctx, logger, sessionCopy); patchErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why patch instead of apply?
why unconditionally?
| } else { | ||
| expiresAt = session.Status.ExpiresAt | ||
| } | ||
| session.Status.ExpiresAt = expiresAt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every pass of your reconciliation function should take one mutating action, period - consider that your controller may be preempted or crash at any point. any in-memory state you store here that cannot be trivially reconstructed will be lost. write this update and return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(strongly suggest the data-driven pattern I showed you in hypershift pki operator for modeling the one action your controller takes at the end of reconcile as a struct, then testing with cmp.diff. SSA applyconfigs are a godsend here)
| timeUntilExpiration := time.Until(expiresAt.Time) | ||
| if timeUntilExpiration <= 0 { | ||
| logger.Info("Session has expired, deleting", "session", session.Name, "expiresAt", expiresAt.Time) | ||
| session.MarkSessionInactive(sessiongatev1alpha1.ReasonExpired, "Session has expired") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the point of marking this if you do not update the object in the apiserver before deleting?
| } | ||
|
|
||
| // validation | ||
| if err := validateSession(session); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what can be validated in admission?
| // validation | ||
| if err := validateSession(session); err != nil { | ||
| session.StopProgressing(sessiongatev1alpha1.ReasonInvalidConfiguration, err.Error()) | ||
| logger.Error(err, "Session has invalid configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you do not want to spam the operator's logs with user errors
| // Phase 2: authorization policy | ||
| // | ||
|
|
||
| changed, err := c.ensureAuthorizationPolicy(ctx, session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- write down which authorization policy you will ensure exists,commit that to the session object
- on second reconcile, create the policy
remember you can crash after 1
|
|
||
| switch result.Status { | ||
| case controller.CredentialStatusHostedControlPlaneNotFound: | ||
| logger.V(2).Info("HostedControlPlane not found, will retry when it becomes available") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these things go into status - do not put into logs, just noise for the operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(the person who needs to know that this happened is the end user, who looks at the object, not the person in Kusto)
| func validateSession(session *sessiongatev1alpha1.Session) error { | ||
| // validate TTL is positive | ||
| if session.Spec.TTL.Duration <= 0 { | ||
| return fmt.Errorf("spec.ttl must be a positive duration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of these which can be, should be CEL, so they run during admission and no session object is ever created with invalid fields - clients much prefer to see error on create rather than create, wait, error in status
| hcpResourceID.ResourceType.Namespace, hcpResourceID.ResourceType.Type) | ||
| } | ||
|
|
||
| return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick read seems to show all of these can be CEL
| if session.Status.ExpiresAt != nil { | ||
| statusApply = statusApply.WithExpiresAt(*session.Status.ExpiresAt) | ||
| } | ||
| if session.Status.Endpoint != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just construct the literal applyconifg you want in the reconcile func, it is dangerous and unlikely to be correct to try to generically construct this later. notably - you cannot distinguish between "I want to set the endpoint to the empty string" from "I do not want to change the endpoint" here
sessiongate is a controller that manages secure breakglass sessions
it consists of a controlplane and a dataplane
the controlplane runs only in one controller pods (leader election) and owns the Session CRs, managing credential minting via CSR and CSRApproval CRs on the MCs and authn/z for session access via istio authorization policies and mise. once a session is ready, the cdataplane will notice via contitions on the Session CR. credentials for a session (private key and certificate) are stored in kubernetes secrets.
the dataplane runs on all controller pods and offers a webserver with endpoints for every active session. such endpoints act as proxy to the actual HCP on an MC. right now, this only works for public clusters. once we have private clusters, we need a tunnel solution of some sort (e.g. via port-forwarding or via a bastion)
sessions can expire and are deleted by the controlplane when that happens. the dataplane reacts to deletions (or sessions becoming
Ready: falsewith unregistration of the session from the webserver and closing all open connections (e.g. watches or websockets).the admin api will use the Session CR to craft timelimited and accesslimited sessions for SREs. it will hand out kubeconfigs to SREs that will not contain any credentials, just the URLs to the session endpoint on the dataplane and a kubelogin exec section that will provide an access token for the user. the istio access policy will validate that access token towards the session ownership and grant access only to the session owner. MISE plays a role as hardened JWT validation service before traffic even hits the session endpoint or the istio policy.
about the app id used in the kubelogin command
this is the one AKS uses. we will work on the creation of a dedicated app registration for our purpose, one that is single-tenant and lives in xME, allowing SREs with AME access to mint a token (keep in mind that the token alone will not grant access, as the session also needs to be created by principal mentioned in the token)
Example:
Resulting Kubeconfig
What
Why
Special notes for your reviewer