Skip to content

Poisoned pod, extremely slow, weird thread counts #4448

@KadekM

Description

@KadekM

Our pods (hosted in kubernetes) sometimes get poisoned and 100% of their requests become incredibly slow (think 5-25secs instead of <1s).
Response is slow even with zero traffic (essentially only test curl).
Non poisioned pod from same deployment works just fine (and has normal numbers in jmx etc).

interesting jmx of poisioned pod

ComputePoolSampler.ActiveThreadCount is 65534 
ComputePoolSampler.WorkerThreadCount is 2
ComputePoolSampler.SearchingThreadCount is 65535. 
ComputePoolSampler.BlockedWorkerThreadCount is 0

Weirdly threaddump shows < 100threads. threads_dump.txt

scala2.13.6
cats-effect: 3.6.2
doobie 1.0.0-RC9
sttp client with armeria backend 3.11.0
tapir + http4s 1.11.36

vm 17.0.15 eclipse adoptium (observed also on 21)

jvm options
 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.port=1099 -Dcom.sun.management.jmxremote.rmi.port=1099 -Djava.rmi.server.hostname=127.0.0.1 -XX:+UseG1GC -XX:G1PeriodicGCInterval=30000 -XX:+PrintCommandLineFlags -Dsun.net.inetaddr.ttl=30 -Dpidfile.path=/dev/null -XX:+CrashOnOutOfMemoryError -XX:MaxMetaspaceSize=320m -XX:ReservedCodeCacheSize=64m -XX:CompressedClassSpaceSize=64m -XX:InitialHeapSize=519m -XX:MaxHeapSize=1038m
-XX:CompressedClassSpaceSize=67108864 -XX:ConcGCThreads=1 -XX:+CrashOnOutOfMemoryError -XX:G1ConcRefinementThreads=2 -XX:G1PeriodicGCInterval=30000 -XX:GCDrainStackTargetSize=64 -XX:InitialHeapSize=544210944 -XX:+ManagementServer -XX:MarkStackSize=4194304 -XX:MaxHeapSize=1088421888 -XX:MaxMetaspaceSize=335544320 -XX:MinHeapSize=6815736 -XX:+PrintCommandLineFlags -XX:ReservedCodeCacheSize=67108864 -XX:-THPStackMitigation -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC

If we kill the pod the new one works just fine, but roughly once a day we see this happen on eventually on multiple pods across multiple services, where only commonality is cats-effect 3.6.2, latest scala2, tapir and sttp client, doobie rc9 (though not necessarily the same client/server backends).

k8 instances

 limits:
            cpu: 1100m
            memory: 1200Mi
          requests:
            cpu: 600m
            memory: 1200Mi

Some pods never get poisoned, we haven't yet observed any pattern except that it happens semi-regularly, and once it's poisioned it doesn't recover.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions