Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caddy crashes under load, with huge number of mutex locks in cache handler #114

Open
alistairjevans opened this issue Dec 6, 2024 · 2 comments

Comments

@alistairjevans
Copy link

Under reasonably small load (~250 req/sec), when using the cache-handler (we're using the go-redis storage, but I saw this happen with the default one as well), we get a fatal error: concurrent map iteration and map write in Caddy, this happens reasonably frequently (maybe once a day), requiring a restart.

 fatal error: concurrent map iteration and map write
 
 goroutine 27695297 [running]:
 net/http.Header.Clone(...)
 	net/http/header.go:101
 net/http.(*response).WriteHeader(0xc920c73c00, 0x1f4)
 	net/http/server.go:1231 +0x1e7
 github.com/caddyserver/caddy/v2/modules/caddyhttp.(*responseRecorder).WriteHeader(0x1ef02e0?, 0x1a70889?)
 	github.com/caddyserver/caddy/[email protected]/modules/caddyhttp/responsewriter.go:167 +0xb6
 github.com/caddyserver/caddy/v2/modules/caddyhttp.(*Server).ServeHTTP(0xc00089a008, {0x1f00d70, 0xc920c73c00}, 0xcf30e57040)
 	github.com/caddyserver/caddy/[email protected]/modules/caddyhttp/server.go:415 +0x1535
 net/http.serverHandler.ServeHTTP({0xcfd3b14750?}, {0x1f00d70?, 0xc920c73c00?}, 0x6?)
 	net/http/server.go:3210 +0x8e
 net/http.(*conn).serve(0xcef56bf4d0, {0x1f04538, 0x1070ae8cba0})
 	net/http/server.go:2092 +0x5d0
 created by net/http.(*Server).Serve in goroutine 181
 	net/http/server.go:3360 +0x485

This is accompanied by about 1.9 million of these goroutine "Sync.lock.mutex" states, implying (I guess) that a huge number of goroutines are waiting on a mutex.

goroutine 21446179 [sync.Mutex.Lock, 92 minutes]:
sync.runtime_SemacquireMutex(0xc918f18cb0?, 0xb8?, 0xc001a52708?)
	runtime/sema.go:95 +0x25
sync.(*Mutex).lockSlow(0xc0008fc640)
	sync/mutex.go:173 +0x15d
sync.(*Mutex).Lock(...)
	sync/mutex.go:92
github.com/darkweak/souin/pkg/surrogate/providers.(*baseStorage).storeTag(0xc000a80360, {0x0, 0x0}, {0xc61aaf6ac0, 0x38}, 0xcb96172320)
	github.com/darkweak/[email protected]/pkg/surrogate/providers/common.go:168 +0xad
github.com/darkweak/souin/pkg/surrogate/providers.(*baseStorage).Store(0xc000a80360, 0xc918f18ec0?, {0xc428c81400?, 0x90?}, {0x19dce40?, 0xcc0cfff7
	github.com/darkweak/[email protected]/pkg/surrogate/providers/common.go:232 +0x27f
github.com/darkweak/souin/pkg/middleware.(*SouinBaseHandler).Store.func2({{0x0, 0x0}, 0x12d, {0x0, 0x0}, 0x0, 0x0, 0xca7ca1ea50, {0x1efafc8, 0xcd3d
	github.com/darkweak/[email protected]/pkg/middleware/middleware.go:376 +0xcd
created by github.com/darkweak/souin/pkg/middleware.(*SouinBaseHandler).Store in goroutine 21446047
	github.com/darkweak/[email protected]/pkg/middleware/middleware.go:375 +0x2891
2024-12-06T08:04:16.889267314Z status_proxy_caddy_caddy.0.t7ydd1vd5hya@statuspage-1    | 

My theory is that the crash happens when some pool is finally exhausted.

Attached a log file containing the first 1000 lines, the full thing is many gigabytes (but I can provide if interested, they're all different goroutines waiting at the same point in the code)

limited.txt

Build:

FROM caddy:2.8.4-builder AS builder

RUN xcaddy build \
    --with github.com/pberkel/[email protected] \
    --with github.com/caddyserver/[email protected] \
    --with github.com/darkweak/storages/go-redis/caddy \
    --with github.com/aksdb/[email protected]

Caddyfile, some values redacted.

(ip_block) {
    @block {             
        remote_ip REDACTED 
    }
    respond @block "Access denied" 403
}

(host_block) {
    @host_block {
        host REDACTED
    }
    respond @host_block "Rate limited" 429
}

{
    on_demand_tls {
        ask https://REDACTED
    }

    storage redis {
        host           "{$REDIS_HOST}"
        username       "{$REDIS_USER}"
        password       "{$REDIS_PASSWORD}"
    }

    servers {
        metrics
    }

    cache {
        # how our cache shows in the response header; default is Souin.
        cache_name sc

        redis {
            configuration {
                Addrs "{$REDIS_CACHE_HOST}:6379"
                User "{$REDIS_CACHE_USER}"
                Password "{$REDIS_CACHE_PASSWORD}"
                DB 0
            }
        }
    }

    order cgi before respond
}

# Catch-all for any domain
:443 {
    log

    tls {
        # we do some funky certificate selection stuff with a local script, fallback to the default
        get_certificate http http://localhost:4434/cert
        on_demand
    }

    import ip_block
    import host_block

    cache {
        key {
            hide
            headers Accept-Language
        }
        # ignore any cache-control indicators from the client
        mode bypass_request
    }

    # Reverse proxy to server.betteruptime.com
    reverse_proxy https://REDACTED {
        header_up Host {upstream_hostport}
        header_up X-Real-IP {http.request.remote.host}
    }
}

http://:9190 {
    metrics /metrics
}

http://localhost:4434 {
    cache {
        stale 5m
        ttl 5m
    }
    cgi /cert /etc/caddy/get_fallback_cert.sh
}
@alistairjevans
Copy link
Author

@darkweak, did you have any thoughts on this issue or more information you require? Cache-handler is entirely unusable in production because of this issue, and I don't think it's related to a specific provider.

@darkweak
Copy link
Collaborator

darkweak commented Jan 2, 2025

Hello, I have to write a patch that doesn't send the headers if that's an asynchronous revalidation (I think it tries to update the status code when the async revalidation is done)
You can try with the latest commit of this PR darkweak/souin#589 because it contains some patches related to the revalidation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants