Skip to content

Conversation

@geoberle
Copy link
Collaborator

@geoberle geoberle commented Dec 22, 2025

sessiongate is a controller that manages secure breakglass sessions
it consists of a controlplane and a dataplane

the controlplane runs only in one controller pods (leader election) and owns the Session CRs, managing credential minting via CSR and CSRApproval CRs on the MCs and authn/z for session access via istio authorization policies and mise. once a session is ready, the cdataplane will notice via contitions on the Session CR. credentials for a session (private key and certificate) are stored in kubernetes secrets.

the dataplane runs on all controller pods and offers a webserver with endpoints for every active session. such endpoints act as proxy to the actual HCP on an MC. right now, this only works for public clusters. once we have private clusters, we need a tunnel solution of some sort (e.g. via port-forwarding or via a bastion)

sessions can expire and are deleted by the controlplane when that happens. the dataplane reacts to deletions (or sessions becoming Ready: false with unregistration of the session from the webserver and closing all open connections (e.g. watches or websockets).

the admin api will use the Session CR to craft timelimited and accesslimited sessions for SREs. it will hand out kubeconfigs to SREs that will not contain any credentials, just the URLs to the session endpoint on the dataplane and a kubelogin exec section that will provide an access token for the user. the istio access policy will validate that access token towards the session ownership and grant access only to the session owner. MISE plays a role as hardened JWT validation service before traffic even hits the session endpoint or the istio policy.

about the app id used in the kubelogin command
this is the one AKS uses. we will work on the creation of a dedicated app registration for our purpose, one that is single-tenant and lives in xME, allowing SREs with AME access to mint a token (keep in mind that the token alone will not grant access, as the session also needs to be created by principal mentioned in the token)

Example:

apiVersion: sessiongate.aro-hcp.azure.com/v1alpha1
kind: Session
metadata:
  name: gerd
  namespace: sessiongate
spec:
  ttl: 24h
  managementCluster:
    resourceId: /subscriptions/1d3378d3-5a3f-4712-85a1-2485495dfc4b/resourceGroups/hcp-underlay-pers-usw3gobe-mgmt-1/providers/Microsoft.ContainerService/managedClusters/pers-usw3gobe-mgmt-1
  hostedControlPlane:
    resourceId: /subscriptions/1d3378d3-5a3f-4712-85a1-2485495dfc4b/resourceGroups/goberlec-net-rg-03/providers/Microsoft.RedHatOpenshift/hcpOpenShiftClusters/goberlec
    namespace: ocm-arohcpdev-2nbenhk7fsp3jiap09fe4oo0ag2946vs-goberlec
  accessLevel:
    group: aro-sre-cluster-admin
  owner:
    type: User
    userPrincipal:
      name: [email protected]
      claim: upn
status:
  backendKASURL: https://api.goberlec.xzzl.usw3gobe.hcp.osadev.cloud
  conditions:
  - lastTransitionTime: "2025-12-23T11:11:59Z"
    message: Authorization policy exists
    observedGeneration: 1
    reason: AsExpected
    status: "True"
    type: AuthorizationPolicyAvailable
  - lastTransitionTime: "2025-12-23T11:12:13Z"
    message: Credentials Secret exists
    observedGeneration: 1
    reason: AsExpected
    status: "True"
    type: CredentialsAvailable
  - lastTransitionTime: "2025-12-23T11:12:14Z"
    message: Network path exists
    observedGeneration: 1
    reason: AsExpected
    status: "True"
    type: NetworkPathAvailable
  - lastTransitionTime: "2025-12-23T11:12:14Z"
    message: Session is available
    observedGeneration: 1
    reason: Available
    status: "False"
    type: Progressing
  - lastTransitionTime: "2025-12-23T11:12:14Z"
    message: Session is ready
    observedGeneration: 1
    reason: Ready
    status: "True"
    type: Ready
  - lastTransitionTime: "2025-12-23T11:12:14Z"
    message: Session is active
    observedGeneration: 1
    reason: AsExpected
    status: "True"
    type: SessionActive
  credentialsSecretRef: session-gerd-credentials
  endpoint: https://admin.usw3gobe.hcpsvc.osadev.cloud/sessiongate/gerd/kas
  expiresAt: "2025-12-24T11:11:58Z"

Resulting Kubeconfig

apiVersion: v1
clusters:
- cluster:
    server: https://admin.usw3gobe.hcpsvc.osadev.cloud/sessiongate/gerd/kas
  name: hcp
contexts:
- context:
    cluster: hcp
    user: sre-on-hcp
  name: hcp
current-context: hcp
kind: Config
preferences: {}
users:
- name: sre-on-hcp
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      args:
      - get-token
      - --login
      - azurecli
      - --server-id
      - 6dae42f8-4368-4678-94ff-3960e28e3630
      command: kubelogin
      interactiveMode: IfAvailable
      provideClusterInfo: false

What

Why

Special notes for your reviewer

@openshift-ci
Copy link

openshift-ci bot commented Dec 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: geoberle

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@geoberle geoberle marked this pull request as draft December 22, 2025 17:56
@openshift-ci openshift-ci bot requested review from deads2k and janboll December 22, 2025 17:56
@geoberle geoberle force-pushed the sessiongate-controller branch 9 times, most recently from 49402f2 to 619a687 Compare December 23, 2025 13:25
sessiongate is a controller that manages secure breakglass sessions

* credential minting
* authn/z for session access
  * mise integration for bearer token validation
  * istio authorizationpolicy integration for session access gating
* HCP KAS proxy management

Signed-off-by: Gerd Oberlechner <[email protected]>
@geoberle geoberle force-pushed the sessiongate-controller branch from 619a687 to c4367bf Compare December 23, 2025 13:35
// This package imports things required by build scripts, to force `go mod` to see them as dependencies
package tools

import _ "k8s.io/code-generator"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should do this with bingo


// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="spec is immutable"
type SessionSpec struct {
// ttl is the time-to-live duration for the session
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calculated from what point? creation timestamp of the object? establishment of the session?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creation of the session CR. it needs to be from a geneva action runtime perspective to be comprehendable by customers. i will document this better

TTL metav1.Duration `json:"ttl"`

// managementCluster specifies the AKS management cluster
// +kubebuilder:validation:Required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tags will apply if you have them one block higher - so:

	// +kubebuilder:validation:Required

	// managementCluster specifies the AKS management cluster
	ManagementCluster ManagementCluster `json:"managementCluster"`

this also keeps them from showing up in your GoDoc. I prefer that style

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// ManagementCluster identifies an Azure management cluster
type ManagementCluster struct {
// resourceId is the Azure resource ID of the management cluster
// +kubebuilder:validation:Required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look into using CEL for immutable constraints - do you really want to allow mutations to this field after creation?

See example: https://github.com/openshift/hypershift/blob/351f6ae4538229f30733e474f1596fdc9813b073/api/certificates/v1alpha1/certificaterevocationrequest_types.go#L28

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i misunderstood the CEL statement. i thought having it on spec is sufficiently guarding against modification. i will adapt

// +kubebuilder:validation:Pattern=`^/subscriptions/[a-fA-F0-9-]+/resourceGroups/[^/]+/providers/[^/]+/[^/]+/[^/]+$`
ResourceID string `json:"resourceId"`

// namespace of the HostedControlPlane CR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this something we can figure out by knowing the HCP resource ID? all we need is the CS cluster ID, which is not something the user needs to know anyway

}

type AccessLevel struct {
// group is the name of the access group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is insufficient godoc for me to understand what this field does - what are valid values - how do they map into the cluster - etc

}

type SessionStatus struct {
// INSERT ADDITIONAL STATUS FIELD - define observed state of cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover from sample-controller :D

// +optional
CredentialsSecretRef string `json:"credentialsSecretRef,omitempty"`

// backendKASURL is the Kubernetes API server URL for the backend cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this different from the endpoint?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

endpoint is the ingress from an SRE
backend is where the proxy sends data to, which is

  • the public KAS url if the cluster is public
  • something else (cluster internal port-forward endpoint) for private clusters

BackendKASURL string `json:"backendKASURL,omitempty"`
}

func (s *Session) IsReady() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put these into some _conditions.go file for ease of reading types


func (r conditionsImpl) findUnhappyDependent() *metav1.Condition {
// This only works if there are dependents.
if len(r.dependants) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: dependant


// conditionsImpl implements the helper methods for evaluating Conditions.
// +k8s:deepcopy-gen=false
type conditionsImpl struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem related to the v1alpha1 api (to be useful it must apply to many apis generally)- consider moving to some other package in your project

"github.com/Azure/ARO-HCP/sessiongate/pkg/controller"
)

// buildAuthorizationPolicyApplyConfig creates an ApplyConfiguration for Server-Side Apply.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: every ApplyConfiguration is for SSA - describe why & how, not what

},
}

// Build ApplyConfiguration using fluent builder pattern
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low-value comment that should be removed

}

// Build ApplyConfiguration using fluent builder pattern
//nolint:govet // copylocks: protobuf message contains a mutex internally; builder copies it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what protobuf message? even if true, duplicate it, don't copy locks

return false, fmt.Errorf("failed to apply AuthorizationPolicy: %w", err)
}

// determine if a change was made
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have the existing policy, instead of always sending the mutating call, why not diff the intended config and exit early if there's nothing to do?

}

// if the policy exists, check if it is owned by the session
existingPolicy, err := c.authzPoliciesLister.AuthorizationPolicies(c.sessionNamespace).Get(policyName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is racy - nothing stops someone from changing ownership between this read and the mutating apply - why is ownership necessary? when will it not be the case? can you factor the logic such that you only work on correctly-owned policies in the frst place?


package controlplane

// Condition types for Session resources
Copy link
Contributor

@stevekuznetsov stevekuznetsov Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of your API, put it in that module

// control plane controller implementation for Session resources.
// it runs with leader election and handles Session reconciliation into
// - istio AuthorizationPolicies
// - session credentials secrets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at the validating admission policies used in the secret sync controller or the acr pull binding controller for how to secure this component from having root on the cluster


utilruntime.Must(sessiongateschema.AddToScheme(scheme.Scheme))

ratelimiter := workqueue.NewTypedMaxOfRateLimiter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where this came from necessarily, consider the defaults (seen in example controller, etc)

		queue: workqueue.NewTypedRateLimitingQueueWithConfig(
			workqueue.DefaultTypedControllerRateLimiter[gitBranchKey](),
			workqueue.TypedRateLimitingQueueConfig[gitBranchKey]{
				Name: GitControllerName,
			},
		),

(or justify variation from the default)

return nil, fmt.Errorf("failed to add event handler for sessions (control plane): %w", err)
}

// Secret Informer for control plane
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you add the VAP, you won't be able to do this anymore (that's ok)

UpdateFunc: func(old, new interface{}) {
newPolicy := new.(*securityv1beta1.AuthorizationPolicy)
oldPolicy := old.(*securityv1beta1.AuthorizationPolicy)
if newPolicy.ResourceVersion == oldPolicy.ResourceVersion {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove


newSecret := new.(*corev1.Secret)
oldSecret := old.(*corev1.Secret)
if newSecret.ResourceVersion == oldSecret.ResourceVersion {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

return fmt.Errorf("failed to wait for caches to sync")
}

sessions, err := c.sessionsLister.Sessions(c.sessionNamespace).List(labels.Everything())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? the informer should handle it

return true
}

func (c *Controller) workCeremony(ctx context.Context, logger klog.Logger, objRef cache.ObjectName) (time.Duration, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ceremony" is not a common term here

}

sessionCopy := session.DeepCopy()
sessionCopy.InitializeConditions()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this smells - why do you need to do this?


requeueAfter, err := c.syncHandler(ctx, logger, sessionCopy)

if patchErr := c.patchSessionStatus(ctx, logger, sessionCopy); patchErr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why patch instead of apply?

why unconditionally?

} else {
expiresAt = session.Status.ExpiresAt
}
session.Status.ExpiresAt = expiresAt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every pass of your reconciliation function should take one mutating action, period - consider that your controller may be preempted or crash at any point. any in-memory state you store here that cannot be trivially reconstructed will be lost. write this update and return

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(strongly suggest the data-driven pattern I showed you in hypershift pki operator for modeling the one action your controller takes at the end of reconcile as a struct, then testing with cmp.diff. SSA applyconfigs are a godsend here)

timeUntilExpiration := time.Until(expiresAt.Time)
if timeUntilExpiration <= 0 {
logger.Info("Session has expired, deleting", "session", session.Name, "expiresAt", expiresAt.Time)
session.MarkSessionInactive(sessiongatev1alpha1.ReasonExpired, "Session has expired")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the point of marking this if you do not update the object in the apiserver before deleting?

}

// validation
if err := validateSession(session); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what can be validated in admission?

// validation
if err := validateSession(session); err != nil {
session.StopProgressing(sessiongatev1alpha1.ReasonInvalidConfiguration, err.Error())
logger.Error(err, "Session has invalid configuration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you do not want to spam the operator's logs with user errors

// Phase 2: authorization policy
//

changed, err := c.ensureAuthorizationPolicy(ctx, session)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. write down which authorization policy you will ensure exists,commit that to the session object
  2. on second reconcile, create the policy

remember you can crash after 1


switch result.Status {
case controller.CredentialStatusHostedControlPlaneNotFound:
logger.V(2).Info("HostedControlPlane not found, will retry when it becomes available")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these things go into status - do not put into logs, just noise for the operator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the person who needs to know that this happened is the end user, who looks at the object, not the person in Kusto)

func validateSession(session *sessiongatev1alpha1.Session) error {
// validate TTL is positive
if session.Spec.TTL.Duration <= 0 {
return fmt.Errorf("spec.ttl must be a positive duration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these which can be, should be CEL, so they run during admission and no session object is ever created with invalid fields - clients much prefer to see error on create rather than create, wait, error in status

hcpResourceID.ResourceType.Namespace, hcpResourceID.ResourceType.Type)
}

return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick read seems to show all of these can be CEL

if session.Status.ExpiresAt != nil {
statusApply = statusApply.WithExpiresAt(*session.Status.ExpiresAt)
}
if session.Status.Endpoint != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just construct the literal applyconifg you want in the reconcile func, it is dangerous and unlikely to be correct to try to generically construct this later. notably - you cannot distinguish between "I want to set the endpoint to the empty string" from "I do not want to change the endpoint" here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants