Setting rate limits with request headers #557

Stephen-X · 2025-04-05T03:42:10Z

Stephen-X
Apr 5, 2025

I originally posted a version of this question in the Envoy Gateway repo: envoyproxy/gateway#5600.

Hi there,

I'm inquiring if there's currently a way to set the global token rate limit dynamically using a request header.

We run a high-traffic service that hosts thousands of users and are working on enhancing some of the features with GenAI. We are currently exploring using Envoy AI Gateway with some customizations (we understand OAI-compatible embedding API is currently not a supported scenario for EAG, so that's something we could develop to start with and perhaps contribute back) as an internal AI Gateway to a few of our LLM endpoints. This is the high level flow:

[Some frontend service that also manages user configs] ---> [Envoy Gateway] ---> [LLM backends]

To support different configs for each user, EAG requires creating separate K8s custom resources, but that would require a lot of work for us to build a dynamic config loading pipeline for EAG. My understanding is I need to set up a separate xDS control plane to load configs dynamically, per Configuration: Dynamic from control plane? That sounds highly complex. We are hoping to simplify development by letting the frontend service control EAG behavior dynamically for each user.

Instead of having a separate AIGatewayRoute for each user, we're trying to see if it's possible for the frontend, which already has user configs loaded, to control EAG with request headers.

For example, say a user set a maximum rate limit of 11 TPS, then our frontend could be sending requests to the backends with the following 3 headers:

x-user-id: xxxxx
x-user-token-limit-requests: 11
x-user-token-limit-unit: Second

And EAG could be comparing the TPS limit against the counter value in Redis to decide if the current request should be throttled.

Since this is not readily available, we are wondering if we could still utilize the existing rate limit functionality as much as possible with a bit extra dev work.

mathetake · 2025-04-09T07:10:12Z

mathetake
Apr 9, 2025
Maintainer

i am ooo until next week - let me come back (or ping me in case i forget about this thread) later...

0 replies

mathetake · 2025-04-14T18:11:35Z

mathetake
Apr 14, 2025
Maintainer

@Stephen-X so first of all, the actual rate limit configuration is done through EG's BackendTrafficPolicy as you can see in the example. Envoy AI Gateway's role regarding rate limit is configure how to calculate the cost of each request through llmRequestCosts.

a user set a maximum rate limit of 11 TPS

Now the question I have for you is why do you let users set its own request rate limit? how do you enforce the rules? If the situation is really like each user (I assume they are distinguished with say same "user-id" header) has infinite possibility of dynamic rate limit value, i don't think any existing rate limit mechanism work for such case, not the limitation of Envoy Gateway. How do you ensure that each user will send exactly the same x-user-token-limit-requests: 11 header all the time, otherwise how is envoy or the rate limit service expected to calculate the rate limit bucket? I don't think that's possible even when your have full control over xds.
tldr is that I am not sure exactly what kind of problems are you trying to solve given that any user can modify the limit anytime they want. IMO it should not be part of request header, but more of a static configuration. If the frontend service is the one that knows the limit per user and knows how to associate "user-id" to "the token rate limit", then the frontend service should be able to generate the static BackendTrafficPolicy configuration using "Exact" type on user-id header value.

5 replies

Stephen-X Apr 14, 2025
Author

@mathetake In the scenario I described above, Envoy AI Gateway is being used internally, the frontend manages all the user configurations so we are exploring whether we could make the internal gateway stateless in terms of configurations. I think this architecture would also make sense in general if there are thousands of users involved? I'm not sure how it would look like to maintain thousands of BackendTrafficPolicys and expose limited access to every policy to respective user.

In a hypothetical flow, the frontend would first strip any x-user-token-limit-requests request headers a user may provide (to prevent malicious clients from hijacking), then it inserts the x-user-token-limit-requests headers based on user config and sends it to EAG downstream for rate limiting.

the actual rate limit configuration is done through EG's BackendTrafficPolicy as you can see in the example. Envoy AI Gateway's role regarding rate limit is configure how to calculate the cost of each request through llmRequestCosts.

I see, I suppose header-based configurations are not possible unless Envoy Gateway provides such support? If we are to fork EG and implement such a feature ourselves, I wonder if you could provide some pointers as to how we could start.

mathetake Apr 14, 2025
Maintainer

I suppose header-based configurations are not possible unless Envoy Gateway provides such support? If we are to fork EG and implement such a feature ourselves, I wonder if you could provide some pointers as to how we could start.

no, it's not possible with Envoy in general, not the implementation matter of EG or control plane. The rate limit budget must be static (xds) Envoy configuration, not per request.

Stephen-X Apr 14, 2025
Author

no, it's not possible with Envoy in general, not the implementation matter of EG or control plane. The rate limit budget must be static (xds) Envoy configuration, not per request.

I see, I saw Configuration: Dynamic from control plane and thought it could be dynamic, but it looks like it's just essentially storing static configurations externally. And EAG is still relying on https://github.com/envoyproxy/ratelimit/blob/main/README.md#xds-management-server-based-configuration-loading to do the rate limiting

Thank you for the help so far!

mathetake Apr 14, 2025
Maintainer

yeah having said that though, adding support for getting the limit from header seems a valid feature request on envoyproxy/envoy. Maybe opening the issue here https://github.com/envoyproxy/envoy/issues might be helpful anyways. I believe it shouldn't be hard to do that implementation-wise. Basically it's making some change around this code https://github.com/envoyproxy/envoy/blob/8283565cffc7b713ef1b3a8b79c285c269e15db3/source/extensions/filters/common/ratelimit/ratelimit_impl.cc#L54 to use some value from the header instead of descriptor.limit_.value() but it would be better anyways to achieve your goal without making any change to upstream :)

missBerg Apr 15, 2025
Maintainer

@Stephen-X are you saying each user has a unique rate limit? Completely different rate limit number?

Because simple rate limiting per user only requires one policy.

Thousands of users != Thousands of Policies

The user is the bucket
The rate limit is the policy

Imagine in a case where you have 1,2,3,4...100 as possible numbers for the rate limit per, say, minute

You would have 100 policies and then you had conditions for which policy to enforce for the endpoint and user.

mathetake · 2025-04-14T20:00:15Z

mathetake
Apr 14, 2025
Maintainer

btw the embedding endpoint support is definitely valuable so I would love to see that happening here!

2 replies

Stephen-X Apr 14, 2025
Author

No promises yet but will consider :) Thanks again for the info!

From our conversation, right now EAG doesn't seem to fit in our architecture (without us forking Envoy), but I'm still interested in the project for providing at least a global token limit policy. #90 is also something of interest as we are dealing with high-volume traffic, the external processor approach might not be ideal.

mathetake Apr 14, 2025
Maintainer

cool, thank you for letting me know your current thinking! We are hanging out at #envoy-ai-gateway channel of Envoy slack (invite link is here: https://communityinviter.com/apps/envoyproxy/envoy), so feel free to join it and let's chat casually there whenever you want help etc!

Setting rate limits with request headers #557

Uh oh!

Uh oh!

Stephen-X Apr 5, 2025

Replies: 3 comments · 7 replies

Uh oh!

mathetake Apr 9, 2025 Maintainer

Uh oh!

mathetake Apr 14, 2025 Maintainer

Uh oh!

Uh oh!

Stephen-X Apr 14, 2025 Author

Uh oh!

Uh oh!

mathetake Apr 14, 2025 Maintainer

Uh oh!

Uh oh!

Stephen-X Apr 14, 2025 Author

Uh oh!

mathetake Apr 14, 2025 Maintainer

Uh oh!

Uh oh!

missBerg Apr 15, 2025 Maintainer

Uh oh!

mathetake Apr 14, 2025 Maintainer

Uh oh!

Uh oh!

Stephen-X Apr 14, 2025 Author

Uh oh!

mathetake Apr 14, 2025 Maintainer

Stephen-X
Apr 5, 2025

Replies: 3 comments 7 replies

mathetake
Apr 9, 2025
Maintainer

mathetake
Apr 14, 2025
Maintainer

Stephen-X Apr 14, 2025
Author

mathetake Apr 14, 2025
Maintainer

Stephen-X Apr 14, 2025
Author

mathetake Apr 14, 2025
Maintainer

missBerg Apr 15, 2025
Maintainer

mathetake
Apr 14, 2025
Maintainer

Stephen-X Apr 14, 2025
Author

mathetake Apr 14, 2025
Maintainer