Skip to content

feat(policy): cache entitlement policy in auth service v2 #2326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 227 commits into from

Conversation

jakedoublev
Copy link
Contributor

Proposed Changes

Checklist

  • I have added or updated unit tests
  • I have added or updated integration tests (if appropriate)
  • I have added or updated documentation

Testing Instructions

@github-actions github-actions bot added external-contributor External Org Member comp:policy Policy Configuration ( attributes, subject mappings, resource mappings, kas registry) comp:authorization size/m and removed comp:policy Policy Configuration ( attributes, subject mappings, resource mappings, kas registry) comp:authorization labels May 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @jakedoublev, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request to help everyone get up to speed quickly.

This PR introduces a significant performance improvement for the authorization service (v2) by implementing an in-memory cache for entitlement policy data. Specifically, it caches attribute definitions and subject mappings, which are frequently needed for policy decisions. Instead of fetching this data from the policy service on every request, the authorization service will now load it into memory upon startup and periodically refresh it based on a configurable interval. This change aims to reduce latency and load on the policy service, leading to faster authorization decisions.

Highlights

  • Entitlement Policy Caching: An in-memory cache for policy attributes and subject mappings is introduced in the authorization service (v2).
  • Periodic Cache Refresh: The cache is configured to periodically refresh its data from the policy service to stay up-to-date. The refresh interval is configurable.
  • Just-In-Time PDP Integration: The Just-In-Time Policy Decision Point (PDP) within the authorization service is updated to utilize the new cache for fetching policy data instead of making direct calls to the policy service for each request.
  • Service Registration Hook: A new mechanism is added to the service registry and server startup process to allow services to register hooks that run after all services have been registered. This is used here to initialize and start the policy cache and the cached PDP instance.
  • Configuration Option: A new configuration option (cache_refresh_interval_seconds) is added to control the cache refresh frequency.

Changelog

Click here to see the changelog
  • opentdf-dev.yaml
    • Enabled the policy configuration section (lines 45-48).
    • Added cache_refresh_interval_seconds configuration option with a default value of 60 seconds (line 49).
  • opentdf-example.yaml
    • Added a commented-out example for cache_refresh_interval_seconds (line 40).
  • service/authorization/v2/authorization.go
    • Imported the new policy cache package (line 16).
    • Added a pdp field of type *access.JustInTimePDP to the Service struct (line 27).
    • Added OnCompleteServiceRegistration function to create and start the EntitlementPolicyCache and initialize the JustInTimePDP using the cache (lines 34-54).
    • Registered the OnCompleteServiceRegistration hook in the NewRegistration function (line 65).
    • Modified GetEntitlements to use the cached PDP instance (as.pdp) instead of creating a new one on each call (line 109).
    • Modified GetDecision to use the cached PDP instance (as.pdp) instead of creating a new one on each call (line 139).
    • Modified GetDecisionMultiResource to use the cached PDP instance (as.pdp) instead of creating a new one on each call (line 169).
    • Corrected the error return type in GetDecisionMultiResource from connect.NewError to the raw error (line 179).
  • service/internal/access/v2/just_in_time_pdp.go
    • Imported the new policy cache package (line 19).
    • Added a policyCache field of type *policyCache.EntitlementPolicyCache to the JustInTimePDP struct (line 32).
    • Added NewJustInTimePDPWithCachedEntitlementPolicy function to create a PDP that uses the provided cache for fetching policy data (lines 77-128).
    • Modified the logic within NewJustInTimePDPWithCachedEntitlementPolicy to fetch attributes and subject mappings from the cache if it's provided, otherwise fall back to fetching directly (lines 106-119).
  • service/pkg/config/config.go
    • Added ServiceRegistrationCompleteHook type (line 18).
    • Added onServiceRegistrationCompleteHooks slice to the Config struct (line 54).
    • Added AddOnServiceRegistrationCompleteHook method to append hooks (lines 119-121).
    • Added RunRegistrationCompleteHooks method to execute the registered hooks (lines 137-146).
  • service/pkg/server/services.go
    • Added logic to register the OnCompleteServiceRegistrationHook from the service options during service startup (lines 195-197).
  • service/pkg/server/start.go
    • Added a call to cfg.RunRegistrationCompleteHooks after all services have been registered and before the server starts (lines 317-319).
  • service/pkg/serviceregistry/serviceregistry.go
    • Added OnCompleteServiceRegistrationHook type (line 60).
    • Added RegisterOnCompleteServiceRegistrationHook method to the IService interface (line 83).
    • Added OnCompleteServiceRegistration field to the ServiceOptions struct (line 114).
    • Implemented RegisterOnCompleteServiceRegistrationHook method for the Service struct (lines 198-211).
  • service/policy/cache/cache.go
    • New file: Implements the EntitlementPolicyCache struct and its methods for caching policy attributes and subject mappings.
    • Includes methods for creating (NewEntitlementPolicyCache), starting (Start), stopping (Stop), refreshing (Refresh), and retrieving data (GetAttributes, GetSubjectMappings).
    • Implements periodic refresh using a ticker and goroutine (periodicRefresh).
    • Includes helper methods (fetchAllDefinitions, fetchAllSubjectMappings) to fetch data from the SDK with pagination handling.
  • service/policy/config/config.go
    • Added configuredRefreshInterval variable with a default value (line 14).
    • Added CacheRefreshIntervalSeconds field to the Config struct (line 24).
    • Modified GetSharedPolicyConfig to set the global configuredRefreshInterval based on the config value (line 52).
    • Added GetPolicyEntitlementCacheRefreshInterval function to expose the configured interval (lines 59-61).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Policy data flows,
Cached in memory it goes,
Decisions are fast.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a caching mechanism for entitlement policy (attributes and subject mappings) within the authorization service. This is a significant improvement for performance, as it avoids fetching policy data from the policy service on every decision/entitlement request. The implementation involves creating a new policy cache package, integrating it into the authorization service, and updating the service registration flow to initialize the cache and the PDP using the cached data. The changes also include adding configuration options for the cache refresh interval and updating the example YAML files.

The overall approach seems sound and addresses a potential performance bottleneck. However, there are a few areas that require attention, particularly regarding error handling, service readiness, and potential issues with the cache implementation.

Summary of Findings

  • Cache Data Mutability: The GetAttributes and GetSubjectMappings methods in the policy cache return internal slices directly. This allows external callers to modify the cached data, which can lead to race conditions and incorrect behavior. It is recommended to return copies of the slices to prevent external modification.
  • Error Handling Consistency: The error handling in GetDecisionMultiResource was changed to return the raw error from a helper function instead of wrapping it in a connect.NewError with CodeInternal. This alters the gRPC status code returned to the client for this specific error path, potentially breaking clients that expect CodeInternal.
  • Service Readiness Check: The readiness check registration for the authorization service is commented out. A proper readiness check is essential to ensure the service is fully initialized (including the policy cache and PDP) before it is considered ready to handle requests. This should be restored and implemented to reflect the actual readiness state.
  • Efficiency of Fetching All Policy: The cache implementation fetches all active attributes and all subject mappings from the policy service. While the pagination logic correctly handles fetching large datasets, the total volume of policy data could become very large in some deployments, potentially impacting memory usage and refresh times. This is a scalability consideration inherent to caching all policy data.

Merge Readiness

This pull request introduces a valuable caching feature for the authorization service, which should improve performance. However, the identified high severity issues related to cache data mutability, error handling consistency, and the missing readiness check should be addressed before merging. These issues could impact the correctness and reliability of the service. I am unable to approve this pull request directly; please have other reviewers assess the changes after the suggested modifications are made.

@@ -166,7 +176,7 @@ func (as *Service) GetDecisionMultiResource(ctx context.Context, req *connect.Re
resourceDecisions, err := rollupMultiResourceDecision(decisions)
if err != nil {
as.logger.ErrorContext(ctx, "failed to rollup multi resource decision", slog.String("error", err.Error()))
return nil, connect.NewError(connect.CodeInternal, err)
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change modifies the error returned by the GetDecisionMultiResource handler. Previously, it wrapped the error from rollupMultiResourceDecision in a connect.NewError with connect.CodeInternal. Now, it returns the raw error. If rollupMultiResourceDecision returns a standard Go error, the Connect handler will wrap it in connect.NewError(connect.CodeUnknown, err). This changes the gRPC status code returned to the client from Internal to Unknown for this specific error path. Is this change in API behavior intended? It might be better to maintain consistency and return CodeInternal.

Suggested change
return nil, err
return nil, connect.NewError(connect.CodeInternal, err)

Comment on lines +131 to +132
// TODO: we may want to copy this so callers cannot modify the cached data
return c.attributes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The TODO comment here points out a potential issue. Returning the internal slice directly allows callers to modify the cached data, which could lead to race conditions or unexpected behavior if multiple goroutines are reading and writing to the cache concurrently or if a caller unintentionally modifies the shared state. It's generally safer to return a copy of the slice.

	// TODO: we may want to copy this so callers cannot modify the cached data
	attributesCopy := make([]*policy.Attribute, len(c.attributes))
	copy(attributesCopy, c.attributes)
	return attributesCopy

Comment on lines +140 to +141
// TODO: we may want to copy this so callers cannot modify the cached data
return c.subjectMappings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the GetAttributes method, returning the internal slice of subject mappings directly allows callers to modify the cached data. This is a potential source of bugs and race conditions. Returning a copy is safer.

	// TODO: we may want to copy this so callers cannot modify the cached data
	subjectMappingsCopy := make([]*policy.SubjectMapping, len(c.subjectMappings))
	copy(subjectMappingsCopy, c.subjectMappings)
	return subjectMappingsCopy

Copy link
Contributor

Benchmark results, click to expand

Benchmark authorization.GetDecisions Results:

Metric Value
Approved Decision Requests 5000
Denied Decision Requests 0
Total Time 454.519222ms

Benchmark authorization.v2.GetMultiResourceDecision Results:

Metric Value
Approved Decision Requests 5000
Denied Decision Requests 0
Total Time 272.522271ms

Bulk Benchmark Results

Metric Value
Total Decrypts 100
Successful Decrypts 100
Failed Decrypts 0
Total Time 343.064291ms
Throughput 291.49 requests/second

TDF3 Benchmark Results:

Metric Value
Total Requests 5000
Successful Requests 5000
Failed Requests 0
Concurrent Requests 50
Total Time 48.29579536s
Average Latency 479.994809ms
Throughput 103.53 requests/second

NANOTDF Benchmark Results:

Metric Value
Total Requests 5000
Successful Requests 4947
Failed Requests 53
Concurrent Requests 50
Total Time 39.110817494s
Average Latency 386.890343ms
Throughput 126.49 requests/second

Error Summary:

Error Message Occurrences
ReadNanoTDF error: getNanoRewrapKey: rewrapError: internal: internal error
rpc error: code = Internal desc = could not perform access
53 occurrences

Standard Benchmark Metrics Skipped or Failed

@jakedoublev
Copy link
Contributor Author

Closing for #2327

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants