Skip to content

Draft: AI Policies #120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from
Draft

Draft: AI Policies #120

wants to merge 27 commits into from

Conversation

jasonmadigan
Copy link
Member

@jasonmadigan jasonmadigan commented Apr 9, 2025

Re: #118

Not ready for review yet.

jasonmadigan and others added 20 commits April 8, 2025 17:02
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: David Martin <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: David Martin <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>
Signed-off-by: Jason Madigan <[email protected]>

### `LLMPromptRiskCheckPolicy`

A Kuadrant `LLMPromptRiskCheckPolicy` is a custom resource provided by Kuadrant that targets Gateway API resources (`Gateway` and `HTTPRoute`), enabling users to define and enforce content safety rules with LLM prompts to detect and block sensitive prompts. Prompt guards can be defined and enforced for both Gateways and individual HTTPRoutes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be interesting to get some feedback from a SME about what they and where they want things checked. In my naive view I can see the gateway being a good place to check the prompt. But I wonder about the response. Why would you not do that to be done before sending the response over the network?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be possible to avoid network traffic, like in the case of streaming the response.
You may end up streaming/chunking the response and checking it all, or in parts.


A Kuadrant `LLMPromptRiskCheckPolicy` is a custom resource provided by Kuadrant that targets Gateway API resources (`Gateway` and `HTTPRoute`), enabling users to define and enforce content safety rules with LLM prompts to detect and block sensitive prompts. Prompt guards can be defined and enforced for both Gateways and individual HTTPRoutes.

### `LLMResponseRiskCheckPolicy`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My instinct here as hinted at above is to start with the Prompt check and seek more opinion and feedback on the suitability of checking this at the gateway? Maybe it is still desirable to have a "last line of defence" type policy in place?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how you use it, it could be a last line of defence.

It could also be a targeted response check based on a specific group of users for example, since you have access to the users auth context. That capability may not be easily available in the model serving runtime without custom logic.

counter: auth.identity.userid
---
apiVersion: kuadrant.io/v1alpha1
kind: LLMPromptRiskCheckPolicy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder could these be in the same API such as LLMContentPolicy but split underneath by the kuadrant operator as they are very similar? or do you expect the model being used to change based on the policy type?

spec:
  model: ....
  llmprompt:
    categories:
       ....
  llmresponse:
    categories:
response:
...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They certainly could be in the same API.
My sole reason at the time for splitting them was usability.
That is, writing/configuring each part separately in a sizable block.

There may be a more concrete case for keeping them split.
Thoughts on if you want to apply a prompt check to a different group of users than a risk check?
Although not shown in this example, it could be useful to have a predicate for what users the policy applies to (based on auth headers etc..)
If combined into 1 policy, it would mean having multiple predicate fields.
Do we have a precedence for this with existing policies?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes RLP has that

  limits:
    "alice-limit":
      rates:

      - limit: 5
        window: 10s
      when:
      - predicate: "auth.identity.userid == 'alice'"
    "bob-limit":
      rates:
      - limit: 2
        window: 10s
      when:
      - predicate: "auth.identity.userid == 'bob'"

Copy link
Collaborator

@maleck13 maleck13 Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so perhaps its a use case like

spec:
  model: ....
  llmprompt:
    - "under18":
       categories:
          ....
       when:
        - predicate: "auth.identity.age < 18"
      - "over18"
       categories:
          ....
       when:
        - predicate: "auth.identity.age >= 18" 
        - predicate: "request.model == 'educational'"
  llmresponse:
   -  "under18":

This is just sudo stuff but might be useful to think about?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: request.model seem to be wanting to "enhance" the request struct with arbitrary fields. I'd advise against that. When is that model field present? I don't want users to ask themselves these questions. I understand this is just for illustration purposes, but nonetheless raises an interesting question: where would "additional policies" append data to the "well-known attributes"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes where we append this data is an important consideration. Perhaps we need a new namespace for AI metadata. ai.model

Signed-off-by: Jason Madigan <[email protected]>
Co-authored-by: Craig Brookes <[email protected]>
- Either:
- Extend our existing `wasm-shim` to optionally amend the existing `actionSet` to optionally call both the guard filter and the token parsing filter implementation.
- Create a new `ext_proc` gRPC service for parsing OpenAI-style and usage metrics and adding these as well-known dynamic metadata, for use by Limitador
- Extend the wasm-shim and `RateLimitPolicy` to give a means to specify an increment (currently, [hard-coded](https://github.com/Kuadrant/wasm-shim/blob/main/src/service/rate_limit.rs#L18) to `1`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is needed in the RLP api? it seems a very special use case. I wonder instead if it could be made some form of internal config not exposed to the user at this point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is an option yes: we either inc the counter in a custom fashion in something in this filter chain, or we extend RLP somehow to support. if there's general utility in RLP, I guess that route may be more preferable

slight worry about having another mechanism other than limitador doing it - would end up needing to re-implment/copy a bunch of existing machinery? unsure

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I might be misunderstanding. I was instead thinking of using the hits_addend and setting it dynamically if it is an AI interaction. We need @eguzki or @alexsnaps here interested to know what there thoughts are on how to send a custom increment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this isn't implemented afaik, well-known attributes were meant to support this. So one way would be to have ratelimit.hits_addend (while defaulting to 1) be mutable by "upstream" actions. So that a policy could set it to some arbitrary value before the request to limitador is made.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that'd do nicely

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, yes! 🙌 , service.ext_proc.v3.ProcessingResponse support dynamic metadata to have data flow between envoy and these processes. If wasm-shim dispatches the call, then no worries in using that for that. If envoy does, we have to check what happens to it and how/if we can read them back properly from wasm tho (should be fine™ - t&c apply)


### Parsing OpenAI-style usage metrics

OpenAI-style usage metrics for both completion and response APIs generally have a `usage` object, with values for `prompt_tokens` (token count for the initial prompt), `completion_tokens` (tokens generated by the model in response) and `total_tokens` (prompt + response tokens count).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there one of these that is more popular than the other? I am wondering does it make sense to support just one for now and expand beyond that in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume this is re: Chat completion API vs Responses API

The chat completion API is more universally supported, the Responses API is newer, but is designed to work with more use-cases (agentic use-cases, as well as support for "reasoning/show my thoughts" response streaming)


Given the permutations, this will add some extra complexity to how we parse usage metrics. There is a basic Golang example of an `ext_proc` that can parse these metrics (non-streamed responses) here: https://github.com/jasonmadigan/token-ext-proc

We will also want to support llama-stack style responses. Inference chat-completion with llama-stack offers the option for configurable (JSON-schema) guideded `response_format`. This may hint that we'll want to offer some customisation in terms of where to look for metrics, (probably CEL, or JQ-style querying?).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beyond these tokens are there other reasons a user may want to pull something from the body? I think I would start with inferring and looking for these specific values rather than surfacing them into the API at this stage?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in I would prefer the user to have to hint at the response types to expect and we then use that to decide what values to pull out rather than open up the entire request body to the user to pull values out of.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason I think we may want to offer some sort of API here is that although lots of runtimes offer openai-style completion APIs, in the ones I've looked at there are some small differences (i.e. where in the JSON response usage metrics are) which could break the policy. We could have some prebuilt "variants" to make the APIs look nice though (perhaps starting with one for llama-stack and one for openai-style) - these variants would come with builtin selectors on what attributes to pluck to get our usage metrics (if that makes sense)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah like we could for now call out those as the supported options. Rather than jumping straight to giving the power to the user which we might regret and not be able to take back. That said we are talking about Alpha apis so easier to take back then other places

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense. I suppose internally we'll probably use selectors, and then if we decide later, we can expose those to end users

Copy link
Collaborator

@maleck13 maleck13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really great start to some cool features. I think we still need to nail down how we want to do the request filtering (seems to be leaning towards the WASM shim) and also whether we need to expose certain options to the user or not

@jasonmadigan
Copy link
Member Author

One other potential policy which may emerge here, depending on how this PoC progresses, is a SemanticCachingPolicy for short-circuiting (well, partially, still need to do embedding) expensive LLM calls if we see similar prompts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants