-
Notifications
You must be signed in to change notification settings - Fork 18
feat(policy): cache entitlement policy in auth service v2 #2326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…egistrations are complete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @jakedoublev, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request to help everyone get up to speed quickly.
This PR introduces a significant performance improvement for the authorization service (v2) by implementing an in-memory cache for entitlement policy data. Specifically, it caches attribute definitions and subject mappings, which are frequently needed for policy decisions. Instead of fetching this data from the policy service on every request, the authorization service will now load it into memory upon startup and periodically refresh it based on a configurable interval. This change aims to reduce latency and load on the policy service, leading to faster authorization decisions.
Highlights
- Entitlement Policy Caching: An in-memory cache for policy attributes and subject mappings is introduced in the authorization service (v2).
- Periodic Cache Refresh: The cache is configured to periodically refresh its data from the policy service to stay up-to-date. The refresh interval is configurable.
- Just-In-Time PDP Integration: The Just-In-Time Policy Decision Point (PDP) within the authorization service is updated to utilize the new cache for fetching policy data instead of making direct calls to the policy service for each request.
- Service Registration Hook: A new mechanism is added to the service registry and server startup process to allow services to register hooks that run after all services have been registered. This is used here to initialize and start the policy cache and the cached PDP instance.
- Configuration Option: A new configuration option (
cache_refresh_interval_seconds
) is added to control the cache refresh frequency.
Changelog
Click here to see the changelog
- opentdf-dev.yaml
- Enabled the policy configuration section (lines 45-48).
- Added
cache_refresh_interval_seconds
configuration option with a default value of 60 seconds (line 49).
- opentdf-example.yaml
- Added a commented-out example for
cache_refresh_interval_seconds
(line 40).
- Added a commented-out example for
- service/authorization/v2/authorization.go
- Imported the new policy cache package (line 16).
- Added a
pdp
field of type*access.JustInTimePDP
to theService
struct (line 27). - Added
OnCompleteServiceRegistration
function to create and start theEntitlementPolicyCache
and initialize theJustInTimePDP
using the cache (lines 34-54). - Registered the
OnCompleteServiceRegistration
hook in theNewRegistration
function (line 65). - Modified
GetEntitlements
to use the cached PDP instance (as.pdp
) instead of creating a new one on each call (line 109). - Modified
GetDecision
to use the cached PDP instance (as.pdp
) instead of creating a new one on each call (line 139). - Modified
GetDecisionMultiResource
to use the cached PDP instance (as.pdp
) instead of creating a new one on each call (line 169). - Corrected the error return type in
GetDecisionMultiResource
fromconnect.NewError
to the raw error (line 179).
- service/internal/access/v2/just_in_time_pdp.go
- Imported the new policy cache package (line 19).
- Added a
policyCache
field of type*policyCache.EntitlementPolicyCache
to theJustInTimePDP
struct (line 32). - Added
NewJustInTimePDPWithCachedEntitlementPolicy
function to create a PDP that uses the provided cache for fetching policy data (lines 77-128). - Modified the logic within
NewJustInTimePDPWithCachedEntitlementPolicy
to fetch attributes and subject mappings from the cache if it's provided, otherwise fall back to fetching directly (lines 106-119).
- service/pkg/config/config.go
- Added
ServiceRegistrationCompleteHook
type (line 18). - Added
onServiceRegistrationCompleteHooks
slice to theConfig
struct (line 54). - Added
AddOnServiceRegistrationCompleteHook
method to append hooks (lines 119-121). - Added
RunRegistrationCompleteHooks
method to execute the registered hooks (lines 137-146).
- Added
- service/pkg/server/services.go
- Added logic to register the
OnCompleteServiceRegistrationHook
from the service options during service startup (lines 195-197).
- Added logic to register the
- service/pkg/server/start.go
- Added a call to
cfg.RunRegistrationCompleteHooks
after all services have been registered and before the server starts (lines 317-319).
- Added a call to
- service/pkg/serviceregistry/serviceregistry.go
- Added
OnCompleteServiceRegistrationHook
type (line 60). - Added
RegisterOnCompleteServiceRegistrationHook
method to theIService
interface (line 83). - Added
OnCompleteServiceRegistration
field to theServiceOptions
struct (line 114). - Implemented
RegisterOnCompleteServiceRegistrationHook
method for theService
struct (lines 198-211).
- Added
- service/policy/cache/cache.go
- New file: Implements the
EntitlementPolicyCache
struct and its methods for caching policy attributes and subject mappings. - Includes methods for creating (
NewEntitlementPolicyCache
), starting (Start
), stopping (Stop
), refreshing (Refresh
), and retrieving data (GetAttributes
,GetSubjectMappings
). - Implements periodic refresh using a ticker and goroutine (
periodicRefresh
). - Includes helper methods (
fetchAllDefinitions
,fetchAllSubjectMappings
) to fetch data from the SDK with pagination handling.
- New file: Implements the
- service/policy/config/config.go
- Added
configuredRefreshInterval
variable with a default value (line 14). - Added
CacheRefreshIntervalSeconds
field to theConfig
struct (line 24). - Modified
GetSharedPolicyConfig
to set the globalconfiguredRefreshInterval
based on the config value (line 52). - Added
GetPolicyEntitlementCacheRefreshInterval
function to expose the configured interval (lines 59-61).
- Added
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Policy data flows,
Cached in memory it goes,
Decisions are fast.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a caching mechanism for entitlement policy (attributes and subject mappings) within the authorization service. This is a significant improvement for performance, as it avoids fetching policy data from the policy service on every decision/entitlement request. The implementation involves creating a new policy cache package, integrating it into the authorization service, and updating the service registration flow to initialize the cache and the PDP using the cached data. The changes also include adding configuration options for the cache refresh interval and updating the example YAML files.
The overall approach seems sound and addresses a potential performance bottleneck. However, there are a few areas that require attention, particularly regarding error handling, service readiness, and potential issues with the cache implementation.
Summary of Findings
- Cache Data Mutability: The
GetAttributes
andGetSubjectMappings
methods in the policy cache return internal slices directly. This allows external callers to modify the cached data, which can lead to race conditions and incorrect behavior. It is recommended to return copies of the slices to prevent external modification. - Error Handling Consistency: The error handling in
GetDecisionMultiResource
was changed to return the raw error from a helper function instead of wrapping it in aconnect.NewError
withCodeInternal
. This alters the gRPC status code returned to the client for this specific error path, potentially breaking clients that expectCodeInternal
. - Service Readiness Check: The readiness check registration for the authorization service is commented out. A proper readiness check is essential to ensure the service is fully initialized (including the policy cache and PDP) before it is considered ready to handle requests. This should be restored and implemented to reflect the actual readiness state.
- Efficiency of Fetching All Policy: The cache implementation fetches all active attributes and all subject mappings from the policy service. While the pagination logic correctly handles fetching large datasets, the total volume of policy data could become very large in some deployments, potentially impacting memory usage and refresh times. This is a scalability consideration inherent to caching all policy data.
Merge Readiness
This pull request introduces a valuable caching feature for the authorization service, which should improve performance. However, the identified high
severity issues related to cache data mutability, error handling consistency, and the missing readiness check should be addressed before merging. These issues could impact the correctness and reliability of the service. I am unable to approve this pull request directly; please have other reviewers assess the changes after the suggested modifications are made.
@@ -166,7 +176,7 @@ func (as *Service) GetDecisionMultiResource(ctx context.Context, req *connect.Re | |||
resourceDecisions, err := rollupMultiResourceDecision(decisions) | |||
if err != nil { | |||
as.logger.ErrorContext(ctx, "failed to rollup multi resource decision", slog.String("error", err.Error())) | |||
return nil, connect.NewError(connect.CodeInternal, err) | |||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change modifies the error returned by the GetDecisionMultiResource
handler. Previously, it wrapped the error from rollupMultiResourceDecision
in a connect.NewError
with connect.CodeInternal
. Now, it returns the raw error. If rollupMultiResourceDecision
returns a standard Go error, the Connect handler will wrap it in connect.NewError(connect.CodeUnknown, err)
. This changes the gRPC status code returned to the client from Internal
to Unknown
for this specific error path. Is this change in API behavior intended? It might be better to maintain consistency and return CodeInternal
.
return nil, err | |
return nil, connect.NewError(connect.CodeInternal, err) |
// TODO: we may want to copy this so callers cannot modify the cached data | ||
return c.attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment here points out a potential issue. Returning the internal slice directly allows callers to modify the cached data, which could lead to race conditions or unexpected behavior if multiple goroutines are reading and writing to the cache concurrently or if a caller unintentionally modifies the shared state. It's generally safer to return a copy of the slice.
// TODO: we may want to copy this so callers cannot modify the cached data
attributesCopy := make([]*policy.Attribute, len(c.attributes))
copy(attributesCopy, c.attributes)
return attributesCopy
// TODO: we may want to copy this so callers cannot modify the cached data | ||
return c.subjectMappings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the GetAttributes
method, returning the internal slice of subject mappings directly allows callers to modify the cached data. This is a potential source of bugs and race conditions. Returning a copy is safer.
// TODO: we may want to copy this so callers cannot modify the cached data
subjectMappingsCopy := make([]*policy.SubjectMapping, len(c.subjectMappings))
copy(subjectMappingsCopy, c.subjectMappings)
return subjectMappingsCopy
Benchmark results, click to expandBenchmark authorization.GetDecisions Results:
Benchmark authorization.v2.GetMultiResourceDecision Results:
Bulk Benchmark Results
TDF3 Benchmark Results:
NANOTDF Benchmark Results:
Error Summary:
Standard Benchmark Metrics Skipped or Failed |
Closing for #2327 |
Proposed Changes
Checklist
Testing Instructions