-
Notifications
You must be signed in to change notification settings - Fork 11
Draft: AI Policies #120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Draft: AI Policies #120
Conversation
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: David Martin <[email protected]>
Signed-off-by: David Martin <[email protected]>
Signed-off-by: David Martin <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: David Martin <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: David Martin <[email protected]>
Signed-off-by: David Martin <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
09eeb7a
to
4ac95d4
Compare
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
|
||
### `LLMPromptRiskCheckPolicy` | ||
|
||
A Kuadrant `LLMPromptRiskCheckPolicy` is a custom resource provided by Kuadrant that targets Gateway API resources (`Gateway` and `HTTPRoute`), enabling users to define and enforce content safety rules with LLM prompts to detect and block sensitive prompts. Prompt guards can be defined and enforced for both Gateways and individual HTTPRoutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be interesting to get some feedback from a SME about what they and where they want things checked. In my naive view I can see the gateway being a good place to check the prompt. But I wonder about the response. Why would you not do that to be done before sending the response over the network?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be possible to avoid network traffic, like in the case of streaming the response.
You may end up streaming/chunking the response and checking it all, or in parts.
|
||
A Kuadrant `LLMPromptRiskCheckPolicy` is a custom resource provided by Kuadrant that targets Gateway API resources (`Gateway` and `HTTPRoute`), enabling users to define and enforce content safety rules with LLM prompts to detect and block sensitive prompts. Prompt guards can be defined and enforced for both Gateways and individual HTTPRoutes. | ||
|
||
### `LLMResponseRiskCheckPolicy` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My instinct here as hinted at above is to start with the Prompt check and seek more opinion and feedback on the suitability of checking this at the gateway? Maybe it is still desirable to have a "last line of defence" type policy in place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how you use it, it could be a last line of defence.
It could also be a targeted response check based on a specific group of users for example, since you have access to the users auth context. That capability may not be easily available in the model serving runtime without custom logic.
counter: auth.identity.userid | ||
--- | ||
apiVersion: kuadrant.io/v1alpha1 | ||
kind: LLMPromptRiskCheckPolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder could these be in the same API such as LLMContentPolicy
but split underneath by the kuadrant operator as they are very similar? or do you expect the model being used to change based on the policy type?
spec:
model: ....
llmprompt:
categories:
....
llmresponse:
categories:
response:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They certainly could be in the same API.
My sole reason at the time for splitting them was usability.
That is, writing/configuring each part separately in a sizable block.
There may be a more concrete case for keeping them split.
Thoughts on if you want to apply a prompt check to a different group of users than a risk check?
Although not shown in this example, it could be useful to have a predicate for what users the policy applies to (based on auth headers etc..)
If combined into 1 policy, it would mean having multiple predicate fields.
Do we have a precedence for this with existing policies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes RLP has that
limits:
"alice-limit":
rates:
- limit: 5
window: 10s
when:
- predicate: "auth.identity.userid == 'alice'"
"bob-limit":
rates:
- limit: 2
window: 10s
when:
- predicate: "auth.identity.userid == 'bob'"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so perhaps its a use case like
spec:
model: ....
llmprompt:
- "under18":
categories:
....
when:
- predicate: "auth.identity.age < 18"
- "over18"
categories:
....
when:
- predicate: "auth.identity.age >= 18"
- predicate: "request.model == 'educational'"
llmresponse:
- "under18":
This is just sudo stuff but might be useful to think about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: request.model
seem to be wanting to "enhance" the request
struct with arbitrary fields. I'd advise against that. When is that model
field present? I don't want users to ask themselves these questions. I understand this is just for illustration purposes, but nonetheless raises an interesting question: where would "additional policies" append data to the "well-known attributes"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes where we append this data is an important consideration. Perhaps we need a new namespace for AI metadata. ai.model
Signed-off-by: Jason Madigan <[email protected]>
Co-authored-by: Craig Brookes <[email protected]>
- Either: | ||
- Extend our existing `wasm-shim` to optionally amend the existing `actionSet` to optionally call both the guard filter and the token parsing filter implementation. | ||
- Create a new `ext_proc` gRPC service for parsing OpenAI-style and usage metrics and adding these as well-known dynamic metadata, for use by Limitador | ||
- Extend the wasm-shim and `RateLimitPolicy` to give a means to specify an increment (currently, [hard-coded](https://github.com/Kuadrant/wasm-shim/blob/main/src/service/rate_limit.rs#L18) to `1`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this is needed in the RLP api? it seems a very special use case. I wonder instead if it could be made some form of internal config not exposed to the user at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is an option yes: we either inc the counter in a custom fashion in something in this filter chain, or we extend RLP somehow to support. if there's general utility in RLP, I guess that route may be more preferable
slight worry about having another mechanism other than limitador doing it - would end up needing to re-implment/copy a bunch of existing machinery? unsure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I might be misunderstanding. I was instead thinking of using the hits_addend
and setting it dynamically if it is an AI interaction. We need @eguzki or @alexsnaps here interested to know what there thoughts are on how to send a custom increment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this isn't implemented afaik, well-known attributes were meant to support this. So one way would be to have ratelimit.hits_addend
(while defaulting to 1
) be mutable by "upstream" actions. So that a policy could set it to some arbitrary value before the request to limitador is made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that'd do nicely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and, yes! 🙌 , service.ext_proc.v3.ProcessingResponse
support dynamic metadata to have data flow between envoy and these processes. If wasm-shim dispatches the call, then no worries in using that for that. If envoy does, we have to check what happens to it and how/if we can read them back properly from wasm tho (should be fine™ - t&c apply)
|
||
### Parsing OpenAI-style usage metrics | ||
|
||
OpenAI-style usage metrics for both completion and response APIs generally have a `usage` object, with values for `prompt_tokens` (token count for the initial prompt), `completion_tokens` (tokens generated by the model in response) and `total_tokens` (prompt + response tokens count). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there one of these that is more popular than the other? I am wondering does it make sense to support just one for now and expand beyond that in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assume this is re: Chat completion API vs Responses API
The chat completion API is more universally supported, the Responses API is newer, but is designed to work with more use-cases (agentic use-cases, as well as support for "reasoning/show my thoughts" response streaming)
|
||
Given the permutations, this will add some extra complexity to how we parse usage metrics. There is a basic Golang example of an `ext_proc` that can parse these metrics (non-streamed responses) here: https://github.com/jasonmadigan/token-ext-proc | ||
|
||
We will also want to support llama-stack style responses. Inference chat-completion with llama-stack offers the option for configurable (JSON-schema) guideded `response_format`. This may hint that we'll want to offer some customisation in terms of where to look for metrics, (probably CEL, or JQ-style querying?). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beyond these tokens are there other reasons a user may want to pull something from the body? I think I would start with inferring and looking for these specific values rather than surfacing them into the API at this stage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in I would prefer the user to have to hint at the response types to expect and we then use that to decide what values to pull out rather than open up the entire request body to the user to pull values out of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason I think we may want to offer some sort of API here is that although lots of runtimes offer openai-style completion APIs, in the ones I've looked at there are some small differences (i.e. where in the JSON response usage metrics are) which could break the policy. We could have some prebuilt "variants" to make the APIs look nice though (perhaps starting with one for llama-stack and one for openai-style) - these variants would come with builtin selectors on what attributes to pluck to get our usage metrics (if that makes sense)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah like we could for now call out those as the supported options. Rather than jumping straight to giving the power to the user which we might regret and not be able to take back. That said we are talking about Alpha apis so easier to take back then other places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense. I suppose internally we'll probably use selectors, and then if we decide later, we can expose those to end users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really great start to some cool features. I think we still need to nail down how we want to do the request filtering (seems to be leaning towards the WASM shim) and also whether we need to expose certain options to the user or not
One other potential policy which may emerge here, depending on how this PoC progresses, is a |
Re: #118
Not ready for review yet.