You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've experienced numerous segfaults in production that point to this specific line of code
We can reliably reproduce this by simply triggering sidekiq to pause and unpause a queue, which causes it to receive a message from a pubsub channel.
This happens on a handful of the dozens of containers we run.
I was able to get a coredump from one of the containers and here's the backtrace:
(gdb) bt full
#0 0x00007f8c6d51cebc in ?? () from buildkite-hiredis-segfault/eloquent_ride/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1 0x00007f8c6d4cdfb2 in raise () from buildkite-hiredis-segfault/eloquent_ride/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#2 0x00007f8c6da9c8bf in ruby_default_signal (sig=<optimized out>) at signal.c:422
No locals.
#3 0x00007f8c6d8864b8 in rb_bug_for_fatal_signal (default_sighandler=0x0, sig=sig@entry=11, ctx=ctx@entry=0x7f8c11fd9880, fmt=fmt@entry=0x7f8c6dcac4d5 "Segmentation fault at %p") at error.c:1069
file = <optimized out>
line = 92
#4 0x00007f8c6da9b84b in sigsegv (sig=11, info=0x7f8c11fd99b0, ctx=0x7f8c11fd9880) at signal.c:926
No locals.
#5 <signal handler called>
No symbol table info available.
#6 0x00007f8c48a792bd in reply_append (value=<REDACTED>, task=0x7f8c12223690) at hiredis_connection.c:143
state = 0x7f8c0fb5d300
task_index = <optimized out>
state = <optimized out>
task_index = <optimized out>
parent = <optimized out>
key = <optimized out>
rb_gc_guarded_ptr = <optimized out>
rb_gc_guarded_ptr = <optimized out>
rb_gc_guarded_ptr = <optimized out>
#7 reply_create_array (task=0x7f8c12223690, elements=<optimized out>) at hiredis_connection.c:212
value = <REDACTED>
@byroot I don't think I will be able to, Sidekiq Pro is commercially licensed and there's multiple threads which seem to interact in some unknown way to trigger the bug.
I'm trying to replicate the issue with ASAN, as we did in #208
We've experienced numerous segfaults in production that point to this specific line of code
We can reliably reproduce this by simply triggering sidekiq to pause and unpause a queue, which causes it to receive a message from a pubsub channel.
This happens on a handful of the dozens of containers we run.
I was able to get a coredump from one of the containers and here's the backtrace:
Somehow
task_index
is a null pointer, which shouldn't be possible given the code path?The text was updated successfully, but these errors were encountered: