-
Notifications
You must be signed in to change notification settings - Fork 556
Description
Our pods (hosted in kubernetes) sometimes get poisoned and 100% of their requests become incredibly slow (think 5-25secs instead of <1s).
Response is slow even with zero traffic (essentially only test curl).
Non poisioned pod from same deployment works just fine (and has normal numbers in jmx etc).
interesting jmx of poisioned pod
ComputePoolSampler.ActiveThreadCount is 65534
ComputePoolSampler.WorkerThreadCount is 2
ComputePoolSampler.SearchingThreadCount is 65535.
ComputePoolSampler.BlockedWorkerThreadCount is 0
Weirdly threaddump shows < 100threads. threads_dump.txt
scala2.13.6
cats-effect: 3.6.2
doobie 1.0.0-RC9
sttp client with armeria backend 3.11.0
tapir + http4s 1.11.36
vm 17.0.15 eclipse adoptium (observed also on 21)
jvm options
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.port=1099 -Dcom.sun.management.jmxremote.rmi.port=1099 -Djava.rmi.server.hostname=127.0.0.1 -XX:+UseG1GC -XX:G1PeriodicGCInterval=30000 -XX:+PrintCommandLineFlags -Dsun.net.inetaddr.ttl=30 -Dpidfile.path=/dev/null -XX:+CrashOnOutOfMemoryError -XX:MaxMetaspaceSize=320m -XX:ReservedCodeCacheSize=64m -XX:CompressedClassSpaceSize=64m -XX:InitialHeapSize=519m -XX:MaxHeapSize=1038m
-XX:CompressedClassSpaceSize=67108864 -XX:ConcGCThreads=1 -XX:+CrashOnOutOfMemoryError -XX:G1ConcRefinementThreads=2 -XX:G1PeriodicGCInterval=30000 -XX:GCDrainStackTargetSize=64 -XX:InitialHeapSize=544210944 -XX:+ManagementServer -XX:MarkStackSize=4194304 -XX:MaxHeapSize=1088421888 -XX:MaxMetaspaceSize=335544320 -XX:MinHeapSize=6815736 -XX:+PrintCommandLineFlags -XX:ReservedCodeCacheSize=67108864 -XX:-THPStackMitigation -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC
If we kill the pod the new one works just fine, but roughly once a day we see this happen on eventually on multiple pods across multiple services, where only commonality is cats-effect 3.6.2, latest scala2, tapir and sttp client, doobie rc9 (though not necessarily the same client/server backends).
k8 instances
limits:
cpu: 1100m
memory: 1200Mi
requests:
cpu: 600m
memory: 1200Mi
Some pods never get poisoned, we haven't yet observed any pattern except that it happens semi-regularly, and once it's poisioned it doesn't recover.