Tuning backpressure help #406

apenney · 2024-09-27T13:36:09Z

apenney
Sep 27, 2024

Hi!

We're looking at switching over from Gunicorn to Granian but we ran into a weird performance issue. We're running Django 4.x, and in our development environment baseline latency has gone up from 6ms to 60ms.

This was weird, so I started digging into our apm traces and found one of our middlewares (something custom) has gone from an average of 6ms to 60ms.

It's got some unexciting code that looks like:

    def process_request(self, request):
        """
        Adds person & company to request for easy access
        """
        request.person = None
        request.company = None

        user = request.user
        if user.is_authenticated:
            person = user.person
            company = person.company

            request.person = person
            request.company = company

At this point I couldn't find any obvious reasons for issues, so I started to dig into blocking threads, stuff like that, trying to find anything to tune. From my understanding for WSGI the best option for us is:

workers = (we've tried 4 and 8)
workers_lifetime = 3600 (so we don't build up memory leaks)
threads = default (1)

At that point I would have had backpressure=128 by the default calculation, so I then bumped this to 512 to see if it would help but nothing changed about the latency profile. (We don't use a db connection pool, we just let django do it's thing, so I couldn't scale them based on that).

Questions:

Is there anything I could look at in a system running gunicorn to get a realistic value for backpressure?
Is there any reason to touch backlog?
Is there any other options here? Should I try threading mode instead of workers?

The latency hike was pretty noticeable on the charts so I'm hesitant to try and roll this up into environments with more traffic until I better understand what's going on here. Any suggestions would be gratefully appreciated!

apenney · 2024-09-27T21:21:12Z

apenney
Sep 27, 2024
Author

We've spent the day trying all sorts of things, but we still have issues.

We bumped up backpressure, that blew up our database (we ran out of connections). We tried less workers, more workers, all sorts of inbetween configurations.

I think the hardest part is not knowing if granian is contributing to the latency at this point or if we're just wasting our time tuning the wrong things. Are there any metrics or logs (debug logs?) that would help us determine if the io threads/backpressure is too low?

My next plan is to get some continuous profiler running so I can at least see what's going on under the hood.

0 replies

gi0baro · 2024-09-28T16:34:15Z

gi0baro
Sep 28, 2024
Maintainer

I can't think of anything specific which can increase latency 10x compared to gUnicorn (in fact, every benchmark suggests otherwise).

To me sounds like you're overloading the database, so instead of increasing backpressure I would go the other way around and pick very low numbers (4-16 range).

If that doesn't help, the other possible way would be to limit the blocking threads to a low number as well, but I would first try with a low backpressure.

2 replies

apenney Sep 28, 2024
Author

I’ll give it a shot. From what we could see the database wasn’t really doing at the time (based on trying to decipher apm traces), but it’s also hard to tell for sure without profiling. Even in super low traffic scenarios (just health checks) I still saw really high latency.

Would it be worth us exploring threading instead of workers?

As we’re in kubernetes, I also need to explore going back to one worker per pod and more pods instead of the 4-8 workers * 2 pods we’re trying right now.

gi0baro Sep 30, 2024
Maintainer

The threading mode wouldn't probably make a difference, as it involves the Rust side of things only; the Python code interaction would still work in the same way.

And yes, in k8s I generally advise to have 1 worker per pod and scale using replicas, rather than having few fat pods.

If you want Granian to mimic gunicorn, you can kinda forget about backpressure and threat blocking threads as gunicorn threads. Eg: setting blocking threads to 1 will make granian working as the sync gunicorn implementation making requests processing serial.

aldem · 2025-03-31T09:20:05Z

aldem
Mar 31, 2025

I stumbled upon a similar issue while load-testing my app, and found that the load is not evenly distributed among workers - more workers and higher backpressure make it worse under low (relatively) concurrency.

My test ASGI app (running under Granian 2.2.0 and Python 3.13):

import asyncio
import os

from granian import Granian

requests = 0

# We need atomic writes to keep logging nice
def log(*args, **kwargs): os.write(1, (" ".join(args) + "\n").encode())

async def app(scope, receive, send):
    global requests

    if scope['type'] == 'lifespan':
        while True:
            message = await receive()
            if message['type'] == 'lifespan.startup':
                log("Starting up...")
                await send({'type': 'lifespan.startup.complete'})
            elif message['type'] == 'lifespan.shutdown':
                log(f"Shutting down ({requests} requests processed)...")
                await send({'type': 'lifespan.shutdown.complete'})
                return
            else:
                raise RuntimeError("Unexpected type of message")

    assert scope['type'] == 'http'

    # Process request (kind of)
    requests += 1
    # Simulate waiting for something asynchronously
    await asyncio.sleep(0.01)
    # Simulate a bit of CPU load (adjust to taste)
    s = 0
    for n in range(10000):
        s += n

    try:
        await send({
            'type': 'http.response.start',
            'status': 200,
        })
        await send({
            'type': 'http.response.body',
            'body': b'Are we there yet?\n',
        })
    except BaseException as e:
        log(f"Processing request failed on send: {e!r}")
        raise

if __name__ == "__main__":
    log("Running manager process")
    manager = Granian(
        target=__file__,
        interface='asgi',
        workers=4,
        respawn_failed_workers=True,
        #backpressure=10,
        respawn_interval=1
    )
    log(f"Starting manager: {manager}")
    manager.serve()

I test it with h2load: h2load --h1 --warm-up-time=1s -c100 -n100000 http://127.0.0.1:8000 (-c is concurrency and -n is total number of requests), the app log on shutdown with 4 workers:

Shutting down (1000 requests processed)...
Shutting down (37000 requests processed)...
Shutting down (35000 requests processed)...
Shutting down (27000 requests processed)...

So at least one worker is getting minimum requests while another one is also starving a bit. With 8 workers things get worse:

Shutting down (4000 requests processed)...
Shutting down (2000 requests processed)...
Shutting down (4000 requests processed)...
Shutting down (4000 requests processed)...
Shutting down (5000 requests processed)...
Shutting down (25000 requests processed)...
Shutting down (27000 requests processed)...
Shutting down (29000 requests processed)...

We have only 3 which are handling most requests while others are doing much less. And now with 16 workers:

Shutting down (3000 requests processed)...
Shutting down (2000 requests processed)...
Shutting down (4000 requests processed)...
Shutting down (7000 requests processed)...
Shutting down (1000 requests processed)...
Shutting down (4000 requests processed)...
Shutting down (9000 requests processed)...
Shutting down (3000 requests processed)...
Shutting down (13000 requests processed)...
Shutting down (6000 requests processed)...
Shutting down (8000 requests processed)...
Shutting down (3000 requests processed)...
Shutting down (4000 requests processed)...
Shutting down (11000 requests processed)...
Shutting down (15000 requests processed)...
Shutting down (7000 requests processed)...

Results are consistent across multiple runs. The test system is idling during the tests (i.e. loaded only by the test itself).

I would expect that requests are sent in round-robin fashion to every worker which is serving less than backpressure requests currently, but apparently this is not the case, unless I increase concurrency to 1000 - in this case the load is evenly distributed, but even with concurrency 500 the problem remains.

Lowering backpressure to really low values (< 10) makes distribution almost even, but of course it has its drawbacks and practically defeats the purpose of asyncio (and ASGI).

1 reply

gi0baro Mar 31, 2025
Maintainer

@aldem backpressure has nothing to do with load-balancing.

Requests distribution across workers in Granian is provided by the OS Kernel (using SO_REUSEPORT on listener socket), so there's nothing Granian can do to change the behavior you experience – and that's true for every Python HTTP server supporting multiprocessing, I'm pretty sure if you run the same test with uvicorn or hypercorn you will get similar results.

The reason why you get more evenly distributed requests using a low backpressure value is just because you're limiting the throughput on workers, thus the more idle ones will keep accepting requests.

aldem · 2025-03-31T14:07:20Z

aldem
Mar 31, 2025

@gi0baro Sure I didn't expect that backpressure has something to do with load-balancing per se, just expected that if per-worker concurrency is limited then busy workers will simply not accept (= ignore) new requests instead of accumulating them in backlog. Something like this:

while True:
    if active_requests < backpressure:
      await accept_request()
    else:
      await wait_for_some_requests_to_finish()

However, I have an impression from your comment that every worker is doing its own bind() and listen(), am I right? In this case some workers will "swallow" connections which they cannot handle yet (in the per-worker backlog) and thus those requests won't be distributed to other workers.

1 reply

gi0baro Mar 31, 2025
Maintainer

@gi0baro Sure I didn't expect that backpressure has something to do with load-balancing per se, just expected that if per-worker concurrency is limited then busy workers will simply not accept (= ignore) new requests instead of accumulating them in backlog.

Indeed: https://github.com/emmett-framework/granian/blob/v2.2.0/src/workers.rs#L234-L244

In this case some workers will "swallow" connections which they cannot handle yet (in the per-worker backlog) and thus those requests won't be distributed to other workers.

No. The "per-worker backlog" is backpressure. So each worker accept up to N concurrent requests where N is the configured backpressure.

aldem · 2025-03-31T21:22:00Z

aldem
Mar 31, 2025

Now I am a bit puzzled.

In pure C, with rudimentary HTTP simulator, just reading the request and returning static response, using bind()/listen() and epoll() for event handling in each forked worker, including backpressure simulation (ignoring connection events when concurrency is reached), I have even load distribution regardless of backlog and client's concurrency - so this does not look like a kernel issue.

If bind()/listen() is done once (i.e. listen fd is shared among workers) then I have results similar to reported above - however, in your code I see that each worker has its own listener (I am not that familiar with tokio though).

But indeed, uvicorn also suffers from this problem - so it is not Granian specific either.

12 replies

aldem Mar 31, 2025

Hmm... strace disagrees - there is only one bind()/listen() for the whole app:

$ strace -esocket,bind,listen -s1024 -ff granian --workers=4 asgi-test.py 
[INFO] Starting granian (main PID: 3328266)
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
listen(3, 1024)                         = 0
[INFO] Listening at: http://127.0.0.1:8000
strace: Process 3328267 attached
[INFO] Spawning worker-1 with PID: 3328267
strace: Process 3328268 attached
strace: Process 3328269 attached
[INFO] Spawning worker-2 with PID: 3328269
strace: Process 3328271 attached
strace: Process 3328272 attached
...

gi0baro Mar 31, 2025
Maintainer

@aldem I don't get what's your point here.

ofc there's only 1 bind, it's 1 socket, so 1 fd shared across multiple processes and threads.

The code is public, you can literally search for everything (eg: https://github.com/search?q=repo%3Aemmett-framework%2Fgranian+reuse&type=code), you can fork, make your experiments, rewrite the whole thing in C, but you will end up exactly with the same results. 'cause the kernel decides to which process send back the events for the epoll/kqueue syscalls. There's no argument to make here.

aldem Mar 31, 2025

Ok, I will try to explain. When there is a shared listening fd - we have uneven connection distribution (this is how the Linux kernel works, at least).

But if you move the socket creation into the worker itself- so each will get its own fd and does its own bind()/listen()- the the kernel will properly distribute connections and the whole app will benefit from this.

Sure I can do experiment, thing is that you know the code better and it will take probably few hours as I have to study it completely (I don't know Rust good enough, just basics) - but you may need probably just 5-10 minutes :)

I understand that you might not have time to do this - but then perhaps you could point me into the part that does the creation of socket.

gi0baro Apr 1, 2025
Maintainer

@aldem now I see.

can you please open a dedicated issue for this? Hopefully linking some Linux docs regarding the difference in behaviour, as I expected to behave the same, regardless of the FD share.

Also mind that Linux is not my primary env, so even if I'm more familiar with the codebase, it might still require some time as to test everything I need to rely on virtualization.

aldem Apr 1, 2025

Ok. I will do more tests and open an issue then.

I did already some preliminary testing (as it turned out that the relevant code is actually Python) and results are interesting - depending on specific parameters and load pattern I have 30%-100% gain in performance, at least when there are 4 or more workers (especially when there are more and load is high).

In any case the change is quite small and could be easily parametrized to choose either shared or per-worker socket, depending on target system.

aldem · 2025-04-06T18:23:42Z

aldem
Apr 6, 2025

@apenney since v2.2.2 Granian is properly distributing load among all workers on Linux and FreeBSD.

If you are on one of those systems - please try again your test and tell us if this solves your issue.

Also a general recommendation (based on tests) - keep backpressure as low as possible (the default based on backlog and number or workers is ok), too many connections in the queue are preventing proper load sharing, so CPU intensive requests (and Django is known to be CPU hungry) tend to accumulate on a single core and increase latency.

One other thing which you could try is to disable keep-alives (unless you really need i) - this also helps to distribute load evenly among workers, thus reducing latency.

0 replies

grillazz · 2025-04-26T15:24:43Z

grillazz
Apr 26, 2025

@aldem could you elaborate on this "keep backpressure as low as possible (the default based on backlog and number or workers is ok)" ? Any example ?

1 reply

gi0baro Apr 28, 2025
Maintainer

tldr:
On ASGI use the default.
On WSGI follow the readme.

Uh oh!

Tuning backpressure help #406

Uh oh!

Replies: 7 comments · 17 replies

Uh oh!

apenney Sep 27, 2024 Author

Uh oh!

Uh oh!

gi0baro Sep 28, 2024 Maintainer

Uh oh!

apenney Sep 28, 2024 Author

Uh oh!

gi0baro Sep 30, 2024 Maintainer

Uh oh!

Uh oh!

gi0baro Mar 31, 2025 Maintainer

Uh oh!

Uh oh!

gi0baro Mar 31, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

gi0baro Mar 31, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

gi0baro Apr 1, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gi0baro Apr 28, 2025 Maintainer

Replies: 7 comments 17 replies

apenney
Sep 27, 2024
Author

gi0baro
Sep 28, 2024
Maintainer

apenney Sep 28, 2024
Author

gi0baro Sep 30, 2024
Maintainer

gi0baro Mar 31, 2025
Maintainer

gi0baro Mar 31, 2025
Maintainer

gi0baro Mar 31, 2025
Maintainer

gi0baro Apr 1, 2025
Maintainer

gi0baro Apr 28, 2025
Maintainer