Skip to content

"valgrind: the 'impossible' happened: Killed by fatal signal" on macOS 10.13 #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
EricBrunel opened this issue Mar 6, 2025 · 22 comments
Labels
10.13 bug Something isn't working

Comments

@EricBrunel
Copy link

Context

I'm basically trying to debug the latest version of tcl/tk (9.0.1), compiled from the source code on an Intel iMac with macOS 10.13. I got valgrind from Homebrew and tried to run it on the tk interpreter (wish9.0).

What went wrong?

When I try to run valgrind on wish9.0, it gets stuck for a few seconds, then prints out the following error message:

==30525== Memcheck, a memory error detector
==30525== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==30525== Using Valgrind-3.24.0.GIT-lbmacos and LibVEX; rerun with -h for copyright info
==30525== Command: /.../TclTk9.0.1/bin/Darwin64/bin/wish9.0
==30525== Parent PID: 30142
==30525== 
--30525-- VALGRIND INTERNAL ERROR: Valgrind received a signal 11 (SIGSEGV) - exiting
--30525-- si_code=1;  Faulting address: 0x7000741D3CF6;  sp: 0x700000aac760

valgrind: the 'impossible' happened:
   Killed by fatal signal

host stacktrace:
==30525==    at 0x2580915EC: ??? (in /usr/local/Cellar/valgrind/HEAD-ef8cbb3/libexec/valgrind/memcheck-amd64-darwin)
==30525==    by 0x25807335E: ??? (in /usr/local/Cellar/valgrind/HEAD-ef8cbb3/libexec/valgrind/memcheck-amd64-darwin)
==30525==    by 0x2580DBC34: ??? (in /usr/local/Cellar/valgrind/HEAD-ef8cbb3/libexec/valgrind/memcheck-amd64-darwin)
==30525==    by 0x2580BE690: ??? (in /usr/local/Cellar/valgrind/HEAD-ef8cbb3/libexec/valgrind/memcheck-amd64-darwin)
==30525==    by 0x2580BDE15: ??? (in /usr/local/Cellar/valgrind/HEAD-ef8cbb3/libexec/valgrind/memcheck-amd64-darwin)
==30525==    by 0x2580BC22C: ??? (in /usr/local/Cellar/valgrind/HEAD-ef8cbb3/libexec/valgrind/memcheck-amd64-darwin)
==30525==    by 0x2580B9D4D: ??? (in /usr/local/Cellar/valgrind/HEAD-ef8cbb3/libexec/valgrind/memcheck-amd64-darwin)
==30525==    by 0x2580CBE16: ??? (in /usr/local/Cellar/valgrind/HEAD-ef8cbb3/libexec/valgrind/memcheck-amd64-darwin)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable syscall unix:197 (lwpid 771)
==30525==    at 0x1000384E2: __mmap (in /usr/lib/dyld)
==30525==    by 0x100037DEB: mmap (in /usr/lib/dyld)
==30525==    by 0x10001C288: ImageLoaderMachO::validateFirstPages(linkedit_data_command const*, int, unsigned char const*, unsigned long, long long, ImageLoader::LinkContext const&) (in /usr/lib/dyld)
==30525==    by 0x100021C25: ImageLoaderMachOCompressed::instantiateFromFile(char const*, int, unsigned char const*, unsigned long, unsigned long long, unsigned long long, stat const&, unsigned int, unsigned int, linkedit_data_command const*, encryption_info_command const*, ImageLoader::LinkContext const&) (in /usr/lib/dyld)
==30525==    by 0x10001B2AB: ImageLoaderMachO::instantiateFromFile(char const*, int, unsigned char const*, unsigned long, unsigned long long, unsigned long long, stat const&, ImageLoader::LinkContext const&) (in /usr/lib/dyld)
==30525==    by 0x10000AB9F: dyld::loadPhase6(int, stat const&, char const*, dyld::LoadContext const&) (in /usr/lib/dyld)
==30525==    by 0x100011129: dyld::loadPhase5(char const*, char const*, dyld::LoadContext const&, unsigned int&, std::__1::vector<char const*, std::__1::allocator<char const*> >*) (in /usr/lib/dyld)
==30525==    by 0x100010D65: dyld::loadPhase4(char const*, char const*, dyld::LoadContext const&, unsigned int&, std::__1::vector<char const*, std::__1::allocator<char const*> >*) (in /usr/lib/dyld)
==30525==    by 0x100010AE5: dyld::loadPhase3(char const*, char const*, dyld::LoadContext const&, unsigned int&, std::__1::vector<char const*, std::__1::allocator<char const*> >*) (in /usr/lib/dyld)
==30525==    by 0x1000102B3: dyld::loadPhase1(char const*, char const*, dyld::LoadContext const&, unsigned int&, std::__1::vector<char const*, std::__1::allocator<char const*> >*) (in /usr/lib/dyld)
==30525==    by 0x10000A7FE: dyld::loadPhase0(char const*, char const*, dyld::LoadContext const&, unsigned int&, std::__1::vector<char const*, std::__1::allocator<char const*> >*) (in /usr/lib/dyld)
==30525==    by 0x10000A4A1: dyld::load(char const*, dyld::LoadContext const&, unsigned int&) (in /usr/lib/dyld)
==30525==    by 0x1000119FE: dyld::libraryLocator(char const*, bool, char const*, ImageLoader::RPathChain const*, unsigned int&) (in /usr/lib/dyld)
==30525==    by 0x10001883D: ImageLoader::recursiveLoadLibraries(ImageLoader::LinkContext const&, bool, ImageLoader::RPathChain const&, char const*) (in /usr/lib/dyld)
==30525==    by 0x1000189F4: ImageLoader::recursiveLoadLibraries(ImageLoader::LinkContext const&, bool, ImageLoader::RPathChain const&, char const*) (in /usr/lib/dyld)
==30525==    by 0x1000189F4: ImageLoader::recursiveLoadLibraries(ImageLoader::LinkContext const&, bool, ImageLoader::RPathChain const&, char const*) (in /usr/lib/dyld)
==30525==    by 0x1000189F4: ImageLoader::recursiveLoadLibraries(ImageLoader::LinkContext const&, bool, ImageLoader::RPathChain const&, char const*) (in /usr/lib/dyld)
==30525==    by 0x1000189F4: ImageLoader::recursiveLoadLibraries(ImageLoader::LinkContext const&, bool, ImageLoader::RPathChain const&, char const*) (in /usr/lib/dyld)
==30525==    by 0x1000189F4: ImageLoader::recursiveLoadLibraries(ImageLoader::LinkContext const&, bool, ImageLoader::RPathChain const&, char const*) (in /usr/lib/dyld)
==30525==    by 0x1000189F4: ImageLoader::recursiveLoadLibraries(ImageLoader::LinkContext const&, bool, ImageLoader::RPathChain const&, char const*) (in /usr/lib/dyld)
==30525==    by 0x1000189F4: ImageLoader::recursiveLoadLibraries(ImageLoader::LinkContext const&, bool, ImageLoader::RPathChain const&, char const*) (in /usr/lib/dyld)
==30525==    by 0x100017BB0: ImageLoader::link(ImageLoader::LinkContext const&, bool, bool, bool, ImageLoader::RPathChain const&, char const*) (in /usr/lib/dyld)
==30525==    by 0x10000BF69: dyld::link(ImageLoader*, bool, bool, ImageLoader::RPathChain const&, unsigned int) (in /usr/lib/dyld)
==30525==    by 0x10000DFE7: dyld::_main(macho_header const*, unsigned long, int, char const**, char const**, char const**, unsigned long*) (in /usr/lib/dyld)
==30525==    by 0x1000083D3: dyldbootstrap::start(macho_header const*, int, char const**, long, macho_header const*, unsigned long*) (in /usr/lib/dyld)
==30525==    by 0x1000081D1: _dyld_start (in /usr/lib/dyld)
client stack range: [0x1040A5000 0x1048A4FFF] client SP: 0x104898DC8
valgrind stack range: [0x7000009AE000 0x700000AADFFF] top usage: 17072 of 1048576


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.

valgrind does work on a trivial program (e.g, printf("Hello world")); it also does work on the wish interpreter included in macOS: valgrind /usr/bin/wish does not crash with the message above, but actually runs the interpreter.

The tested interpreter was built on the mac where I try to run valgrind by myself a few days ago. Needless to say, it does work without problem when I run it outside of valgrind.

Information

  • macOS architecture (uname -m): x86_64
  • macOS version (sw_vers): 10.13.6
  • Xcode version (xcrun --sdk macosx --show-sdk-version): 10.14
@LouisBrunner LouisBrunner added bug Something isn't working 10.13 labels Mar 6, 2025
@LouisBrunner
Copy link
Owner

Hi @EricBrunel,

This is going to be a bit challenging to debug as I don't have access to a macOS 10.13 machine. However, here are a few steps we can try to see what's going on in Valgrind.

  • Getting valgrind to include debug info would be preferable, not sure why this isn't happening here. Maybe it's worth cloning this repository directly, building it ./autogen.sh && ./configure && make -j 5 and running it without a system-wide installation, ./vg-in-place OPTIONS YOUR_PATH ARGS
  • You can run valgrind through lldb so the debugger can tell you more information about where the crash happened. If you have valgrind built locally, you can use this instead of ./vg-in-place to start a debugging session with LLDB: VALGRIND_LIB=pwd/.in_place VALGRIND_LAUNCHER=pwd/coregrind/valgrind lldb -- ./memcheck/memcheck-amd64-darwin
  • In order to get the maximum of information out of Valgrind internals, you can use a combination of these options and post the result here: -d -d -d -v -v -v -v -v --trace-children=yes --trace-syscalls=yes --trace-malloc=yes --trace-signals=yes --trace-redirs=yes. Heads-up: this will produce a lot of logging.

Reminder for myself: dyld's mmap looks very normal/simple

@EricBrunel
Copy link
Author

Hello Louis and thanks for your answer.

I cloned the repository and built valgrind myself, and I ran the built version on the wish interpreter. Here is its output:

==44935== Memcheck, a memory error detector
==44935== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==44935== Using Valgrind-3.24.0.GIT-lbmacos and LibVEX; rerun with -h for copyright info
==44935== Command: /.../TclTk9.0.1/bin/Darwin64/bin/wish9.0
==44935== 
--44935-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option
--44935-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 2 times)
--44935-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 4 times)
--44935-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 8 times)
==44935== Thread 2:
==44935== Invalid read of size 4
==44935==    at 0x105434E3A: ??? (in /usr/lib/system/libsystem_pthread.dylib)
==44935==    by 0x105434BE8: ??? (in /usr/lib/system/libsystem_pthread.dylib)
==44935==  Address 0x18 is not stack'd, malloc'd or (recently) free'd
==44935== 
==44935== 
==44935== Process terminating with default action of signal 11 (SIGSEGV)
==44935==  Access not within mapped region at address 0x18
==44935==    at 0x105434E3A: ??? (in /usr/lib/system/libsystem_pthread.dylib)
==44935==    by 0x105434BE8: ??? (in /usr/lib/system/libsystem_pthread.dylib)
==44935==  If you believe this happened as a result of a stack
==44935==  overflow in your program's main thread (unlikely but
==44935==  possible), you can try to increase the size of the
==44935==  main thread stack using the --main-stacksize= flag.
==44935==  The main thread stack size used in this run was 8388608.

valgrind: ../../../src/coregrind/m_scheduler/scheduler.c:1028 (void run_thread_for_a_while(HWord *, Int *, ThreadId, HWord, Bool)): Assertion 'VG_(in_generated_code) == False' failed.

host stacktrace:
==44935==    at 0x258059A59: ??? (in /.../Valgrind/bin/Darwin64/libexec/valgrind/memcheck-amd64-darwin)
==44935==    by 0x258059DEF: ??? (in /.../Valgrind/bin/Darwin64/libexec/valgrind/memcheck-amd64-darwin)
==44935==    by 0x258059DCF: ??? (in /.../Valgrind/bin/Darwin64/libexec/valgrind/memcheck-amd64-darwin)
==44935==    by 0x2580F3298: ??? (in /.../Valgrind/bin/Darwin64/libexec/valgrind/memcheck-amd64-darwin)
==44935==    by 0x2580F0E4E: ??? (in /.../Valgrind/bin/Darwin64/libexec/valgrind/memcheck-amd64-darwin)
==44935==    by 0x258104805: ??? (in /.../Valgrind/bin/Darwin64/libexec/valgrind/memcheck-amd64-darwin)
==44935==    by 0x258104ADA: ??? (in /.../Valgrind/bin/Darwin64/libexec/valgrind/memcheck-amd64-darwin)

sched status:
  running_tid=3

Thread 1: status = VgTs_WaitSys syscall unix:398 (lwpid 771)
==44935==    at 0x1053FE8AA: ??? (in /usr/lib/system/libsystem_kernel.dylib)
==44935==    by 0x104003836: __getcwd (in /usr/lib/system/libsystem_c.dylib)
==44935==    by 0x1040033C1: __private_getcwd (in /usr/lib/system/libsystem_c.dylib)
==44935==    by 0x1003E28CF: TclpGetNativeCwd (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==44935==    by 0x10037EFFA: Tcl_FSGetCwd (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==44935==    by 0x10037F448: TclFSCwdIsNative (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==44935==    by 0x1003DA2E1: ZipFSPathInFilesystemProc (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==44935==    by 0x10037FCEC: Tcl_FSGetFileSystemForPath (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==44935==    by 0x10037EB61: Tcl_FSOpenFileChannel (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==44935==    by 0x10037EB0C: Tcl_OpenFileChannel (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==44935==    by 0x1003D4214: ZipFSOpenArchive (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==44935==    by 0x1003D3DE2: TclZipfs_Mount (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==44935==    by 0x1003D7BB9: TclZipfs_AppHook (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==44935==    by 0x100003E2E: main (in /.../TclTk9.0.1/bin/Darwin64/bin/wish9.0)
client stack range: [0x1040A5000 0x1048A4FFF] client SP: 0x1048A3A48
valgrind stack range: [0x7000009AE000 0x700000AADFFF] top usage: 10448 of 1048576

Thread 2: status = VgTs_Yielding (lwpid 5891)
==44935==    at 0x105434E3A: ??? (in /usr/lib/system/libsystem_pthread.dylib)
==44935==    by 0x105434BE8: ??? (in /usr/lib/system/libsystem_pthread.dylib)
client stack range: ??????? client SP: 0x700009146EF0
valgrind stack range: [0x700004B16000 0x700004C15FFF] top usage: 3632 of 1048576

Thread 3: status = VgTs_Runnable (lwpid 6147)
==44935==    at 0x105434BDC: ??? (in /usr/lib/system/libsystem_pthread.dylib)
client stack range: ??????? client SP: 0x7000091C9B00
valgrind stack range: [0x700004C1A000 0x700004D19FFF] top usage: 4144 of 1048576


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.

The part after "Thread 1" actually changes across runs. Sometimes it's one thing, sometimes another. Another thing I had to notice is that if I redirect the output to a file, valgrind has a tendency to start looping, printing out:

==65384== Signal 11 being dropped from thread 0's queue

over and over and over again, and I have to kill it with kill -KILL, nothing else works. So that doesn't make it easy to grab the log...

I did run valgrind within lldb, but I'm not sure the results will be of any help. Here is what I got:

(lldb) target create "./memcheck/memcheck-amd64-darwin"
Current executable set to './memcheck/memcheck-amd64-darwin' (x86_64).
(lldb) run /.../TclTk9.0.1/bin/Darwin64/bin/wish9.0
Process 65248 launched: './memcheck/memcheck-amd64-darwin' (x86_64)
==65248== Memcheck, a memory error detector
==65248== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==65248== Using Valgrind-3.24.0.GIT-lbmacos and LibVEX; rerun with -h for copyright info
==65248== Command: /.../TclTk9.0.1/bin/Darwin64/bin/wish9.0
==65248== 
--65248-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option
--65248-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 2 times)
==65248== Thread 2:
==65248== Invalid read of size 4
==65248==    at 0x105434E3A: ??? (in /usr/lib/system/libsystem_pthread.dylib)
==65248==    by 0x105434BE8: ??? (in /usr/lib/system/libsystem_pthread.dylib)
==65248==  Address 0x18 is not stack'd, malloc'd or (recently) free'd
==65248== 
Process 65248 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x18)
    frame #0: 0x000070000106ba1c
->  0x70000106ba1c: movl   (%rbx), %r10d
    0x70000106ba1f: movl   %r10d, %r14d
    0x70000106ba22: movq   %r15, 0x3b8(%rbp)
    0x70000106ba29: movq   %r14, 0x18(%rbp)
Target 0: (memcheck-amd64-darwin) stopped.
(lldb) bt
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x18)
  * frame #0: 0x000070000106ba1c

Considering the issue with valgrind looping when I redirect its output, I didn't dare to run it with the multiple -d & -v options yet. If you think it is needed, I can do it, I'll just have to hope it won't fill my disk with a huge log...

@LouisBrunner
Copy link
Owner

That's very odd, you get two different issues, depending how you build Valgrind? In your first message, it seemed like an error during mmap and now it looks like the TLS pointer not being set. Is that consistent if you re run each multiple times? (it seems like it from the LLDB output but just too make sure)

==44935== Process terminating with default action of signal 11 (SIGSEGV)
==44935==  Access not within mapped region at address 0x18
==44935==    at 0x105434E3A: ??? (in /usr/lib/system/libsystem_pthread.dylib)
==44935==    by 0x105434BE8: ??? (in /usr/lib/system/libsystem_pthread.dylib)

The part after "Thread 1" actually changes across runs. Sometimes it's one thing, sometimes another. Another thing I had to notice is that if I redirect the output to a file, valgrind has a tendency to start looping, printing out:

==65384== Signal 11 being dropped from thread 0's queue

over and over and over again, and I have to kill it with kill -KILL, nothing else works. So that doesn't make it easy to grab the log...

If you run with just --trace-signals and copy over the whole log I might able to understand what's going wrong there but I will need at least that flag.

Considering the issue with valgrind looping when I redirect its output, I didn't dare to run it with the multiple -d & -v options yet. If you think it is needed, I can do it, I'll just have to hope it won't fill my disk with a huge log...

It shouldn't generate anything close to 1GB of logs 😄 so it shouldn't be a problem but we can look into this signal issue first if you prefer.

@paulfloyd
Copy link
Contributor

paulfloyd commented Mar 7, 2025

This could well be related to the changes that I made for ELF systems to handle multiple RW sections. I had to get the mach-o code to do something similar, but it was a quick and dirty job.

I just opened a bugzilla item for this https://bugs.kde.org/show_bug.cgi?id=501194

I'll have a go at fixing it this weekend.

@paulfloyd
Copy link
Contributor

The macho loading should be fixed. I can run wish and get to the %wish prompt

That's probably not much help as it looks like the process forks. I get many errors with --trace-children=yes.

@EricBrunel
Copy link
Author

I do indeed get different errors depending on whether I run the version of valgrind installed with Homebrew and the version I built myself. As far as I could see, the error with the Homebrew version is always the one with mmap. For the version I built myself, I'm less sure it's consistent, since as I said, the behavior seems to be a bit random: sometimes I'm getting a short log, sometimes this "Signal 11 being dropped from thread 0's queue" message repeated over and over. Each time I did pay attention, there was the error message "valgrind: ../../../src/coregrind/m_scheduler/scheduler.c:1028 (void run_thread_for_a_while(HWord *, Int *, ThreadId, HWord, Bool)): Assertion 'VG_(in_generated_code) == False' failed.".

Anyway, I did run the version I built myself with --trace-signals=yes and here is the output:

==75516== Memcheck, a memory error detector
==75516== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==75516== Using Valgrind-3.24.0.GIT-lbmacos and LibVEX; rerun with -h for copyright info
==75516== Command: /.../TclTk9.0.1/bin/Darwin64/bin/wish9.0
==75516== 
--75516-- Max kernel-supported signal is 31, VG_SIGVGKILL is 31
--75516-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option
--75516-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 2 times)
--75516-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 4 times)
--75516-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 8 times)
--75516-- sys_sigaction: sigNo 13, new 0x1048a4348, old 0x1048a4370, new flags 0x2
==75516== Thread 2:
==75516== Invalid read of size 4
==75516==    at 0x105434E3A: ??? (in /usr/lib/system/libsystem_pthread.dylib)
==75516==    by 0x105434BE8: ??? (in /usr/lib/system/libsystem_pthread.dylib)
==75516==  Address 0x18 is not stack'd, malloc'd or (recently) free'd
==75516== 
--75516-- sync signal handler: signal=11, si_code=1, EIP=0x105434e3a, eip=0x700001141de4, from kernel
--75516-- SIGSEGV: si_code=1 faultaddr=0x18 tid=2 ESP=0x700008fedef0 seg=0x0-0xffffffff
--75516-- delivering signal 11 (SIGSEGV):1 to thread 2
--75516-- delivering 11 (code 1) to default handler; action: terminate+core
==75516== 
==75516== Process terminating with default action of signal 11 (SIGSEGV)
==75516==  Access not within mapped region at address 0x18
==75516==    at 0x105434E3A: ??? (in /usr/lib/system/libsystem_pthread.dylib)
==75516==    by 0x105434BE8: ??? (in /usr/lib/system/libsystem_pthread.dylib)
==75516==  If you believe this happened as a result of a stack
==75516==  overflow in your program's main thread (unlikely but
==75516==  possible), you can try to increase the size of the
==75516==  main thread stack using the --main-stacksize= flag.
==75516==  The main thread stack size used in this run was 8388608.
--75516-- get_thread_out_of_syscall zaps tid 1 lwp 771

valgrind: ../../../src/coregrind/m_scheduler/scheduler.c:1028 (void run_thread_for_a_while(HWord *, Int *, ThreadId, HWord, Bool)): Assertion 'VG_(in_generated_code) == False' failed.

host stacktrace:
==75516==    at 0x258059A59: ??? (in /.../Valgrind/bin/Darwin64/libexec/valgrind/memcheck-amd64-darwin)
==75516==    by 0x258059DEF: ??? (in /.../Valgrind/bin/Darwin64/libexec/valgrind/memcheck-amd64-darwin)
==75516==    by 0x258059DCF: ??? (in /.../Valgrind/bin/Darwin64/libexec/valgrind/memcheck-amd64-darwin)
==75516==    by 0x2580F3298: ??? (in /.../Valgrind/bin/Darwin64/libexec/valgrind/memcheck-amd64-darwin)
==75516==    by 0x2580F0E4E: ??? (in /.../Valgrind/bin/Darwin64/libexec/valgrind/memcheck-amd64-darwin)
==75516==    by 0x258104805: ??? (in /.../Valgrind/bin/Darwin64/libexec/valgrind/memcheck-amd64-darwin)
==75516==    by 0x258104ADA: ??? (in /.../Valgrind/bin/Darwin64/libexec/valgrind/memcheck-amd64-darwin)

sched status:
  running_tid=3

Thread 1: status = VgTs_WaitSys syscall unix:398 (lwpid 771)
==75516==    at 0x1053FE8AA: ??? (in /usr/lib/system/libsystem_kernel.dylib)
==75516==    by 0x104003836: __getcwd (in /usr/lib/system/libsystem_c.dylib)
==75516==    by 0x1040033C1: __private_getcwd (in /usr/lib/system/libsystem_c.dylib)
==75516==    by 0x1003E28CF: TclpGetNativeCwd (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==75516==    by 0x10037EFFA: Tcl_FSGetCwd (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==75516==    by 0x10037F448: TclFSCwdIsNative (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==75516==    by 0x1003DA2E1: ZipFSPathInFilesystemProc (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==75516==    by 0x10037FCEC: Tcl_FSGetFileSystemForPath (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==75516==    by 0x10037EB61: Tcl_FSOpenFileChannel (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==75516==    by 0x10037EB0C: Tcl_OpenFileChannel (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==75516==    by 0x1003D4214: ZipFSOpenArchive (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==75516==    by 0x1003D3DE2: TclZipfs_Mount (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==75516==    by 0x1003D7BB9: TclZipfs_AppHook (in /.../TclTk9.0.1/bin/Darwin64/lib/libtcl9.0.dylib)
==75516==    by 0x100003E2E: main (in /.../TclTk9.0.1/bin/Darwin64/bin/wish9.0)
client stack range: [0x1040A5000 0x1048A4FFF] client SP: 0x1048A3A48
valgrind stack range: [0x7000009AE000 0x700000AADFFF] top usage: 10448 of 1048576

Thread 2: status = VgTs_Yielding (lwpid 9987)
==75516==    at 0x105434E3A: ??? (in /usr/lib/system/libsystem_pthread.dylib)
==75516==    by 0x105434BE8: ??? (in /usr/lib/system/libsystem_pthread.dylib)
client stack range: ??????? client SP: 0x700008FEDEF0
valgrind stack range: [0x700004B1E000 0x700004C1DFFF] top usage: 3648 of 1048576

Thread 3: status = VgTs_Runnable (lwpid 9731)
==75516==    at 0x105434BDC: ??? (in /usr/lib/system/libsystem_pthread.dylib)
client stack range: ??????? client SP: 0x700009070B00
valgrind stack range: [0x700004C22000 0x700004D21FFF] top usage: 4144 of 1048576


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.

Not sure if that's any help...

@LouisBrunner
Copy link
Owner

The macho loading should be fixed. I can run wish and get to the %wish prompt

That's probably not much help as it looks like the process forks. I get many errors with --trace-children=yes.

Thanks @paulfloyd! I am still struggling to get upstream merge into main but once that's done I will pull your changes in.

Each time I did pay attention, there was the error message "valgrind: ../../../src/coregrind/m_scheduler/scheduler.c:1028 (void run_thread_for_a_while(HWord *, Int *, ThreadId, HWord, Bool)): Assertion 'VG_(in_generated_code) == False' failed.".

At least that's consistent, good to know.

Anyway, I did run the version I built myself with --trace-signals=yes and here is the output:

Ah, could you try it a few times until you get the Signal 11 being dropped from thread 0's queue looping message as that's what I want to investigate. Maybe if you could add --trace-syscalls=yes as well it might make easier to guess where Valgrind is not restoring signal masks correctly.

@paulfloyd
Copy link
Contributor

If the xmllint failures during build are the problem that has been fixed by Mark Wielaard:

https://sourceware.org/git/gitweb.cgi?p=valgrind.git;h=9f956db3e5eb0afb0d60987f3658b66646a0ac81

commit 9f956db
Author: Mark Wielaard [email protected]
Date: Sun Mar 9 16:46:50 2025 +0100

docs/Makefile.am: Make sure xml catalog file exists for xmllint check

When XML_CATALOG_FILES don't exist on the system xmllint will have to
query those files through various websites. When there is a network
error xmllint will fail. So make sure to only run the validity tests
when both xmllint and XML_CATALOG_FILES exists.

@LouisBrunner
Copy link
Owner

LouisBrunner commented Mar 11, 2025

If the xmllint failures during build are the problem that has been fixed by Mark Wielaard:

Not only, a bunch of merge artifacts which git didn't handle correctly (I guess things break down after 200+ conflicts).

554dcd0

Anyway, it's now in main and your latest changes are waiting for the CI to pass: #120

@LouisBrunner
Copy link
Owner

@EricBrunel Could you give it another try after pulling the latest changes from the main branch?

@EricBrunel
Copy link
Author

Sorry for the delay. I did test the latest changes on the main branch by cloning the repo, and I've attached the logs I'm getting. The first one (valgrind20250312a.log) is when I run valgrind with no option. The two other ones (b & c) are with the options --trace-signals=yes --trace-sysclass=yes. I include both because the error is not the same. I ran valgrind a few times but couldn't reproduce the issue with the Signal 11 being dropped from thread 0's queue looping message.

valgrind20250312a.log
valgrind20250312b.log
valgrind20250312c.log

@paulfloyd
Copy link
Contributor

paulfloyd commented Mar 12, 2025

It's not serious but I need to look at why it's saying "can't open file to inspect ELF header" and not macho.

I'll give this a go with the upstream code on a macOS 10.13 VM.

Things seem to be going wrong from a call to pipe().

@LouisBrunner
Copy link
Owner

The difference of error is most likely due to a race condition but does seem to indicate there are multiple issues at play:

  • Tcl_Panic from the main thread most likely due to the pipe call failing
  • TLS-related issue in the newly thread created thread

I am unaware of any issue with pipe so I would need to go look into more details. However I am fairly certain I never fixed wqthread on 10.13, I only fixed posix threads.

I will look into those on a later date.

It's not serious but I need to look at why it's saying "can't open file to inspect ELF header" and not macho.

Because that's the hardcoded message in debuginfo.c's di_notify_mmap. Given the path of the libraries, it might be because macOS is expected to be using DYLD_SHARED_CACHED for this? If so we would need to backport all those fixes.

@paulfloyd
Copy link
Contributor

10.13 doesn;t use DSC (or at least it was still possible to request to use files rather than DSC). I think that 10.15 was when the option to use files was removed, making DSC mandatory.

I'll fix the message soon.

@LouisBrunner
Copy link
Owner

10.13 doesn;t use DSC (or at least it was still possible to request to use files rather than DSC). I think that 10.15 was when the option to use files was removed, making DSC mandatory.

I am well aware but this is only because Valgrind itself tells dyld not to use it, not because it's not available. DSC has been around since iOS 3.1 (2010) and it's clearly supported in the dyld version of 10.13. The question is why would those files not be available on 10.13 even though they were only removed in 10.15/11 as you said yourself.

@paulfloyd
Copy link
Contributor

I'll take a look this evening.

@paulfloyd
Copy link
Contributor

With the upstream code in a 10.13 VM I can run "vg-in-place wish" (which runs /usr/bin/wish). I get to the wish prompt and en empty wish graphical window.

"vg-in-place --trace=children=yes wish" gives me an execve failure for /usr/bin/dirname

With main from this repo af822ef, again I get to the wish prompt without tracing children. I get a lot further - to the UNKNOWN mach_msgs when tracing children.

The files generating the ELF error message do exist. It's quite possible that there is an issue reading universal binaries. For instance

file /System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/Print.framework/Versions/A/Print
/System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/Print.framework/Versions/A/Print: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit dynamically linked shared library x86_64] [i386:Mach-O dynamically linked shared library i386]
/System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/Print.framework/Versions/A/Print (for architecture x86_64): Mach-O 64-bit dynamically linked shared library x86_64
/System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/Print.framework/Versions/A/Print (for architecture i386): Mach-O dynamically linked shared library i386

I need to do some more debugging and poking around with otool.

@LouisBrunner
Copy link
Owner

Thanks for looking into this @paulfloyd.

Given that pipe() is failing with 0x18 (i.e. errno EMFILE, Too many descriptors are active), could these issues be related? Are we aware of any part of Valgrind that might be leaking fds?

@EricBrunel Could you run valgrind again with all the previous flags and add --track-fds=all as well?

@EricBrunel
Copy link
Author

Output of valgrind with options --trace-signals=yes --trace-sysclass=yes --track-fds=all attached. Looks like there isn't any file descriptor leak: the only ones opened in the end are for the tty and the log file...

valgrind20250312e.log

@paulfloyd
Copy link
Contributor

Thanks for looking into this @paulfloyd.

Given that pipe() is failing with 0x18 (i.e. errno EMFILE, Too many descriptors are active), could these issues be related? Are we aware of any part of Valgrind that might be leaking fds?

@EricBrunel Could you run valgrind again with all the previous flags and add --track-fds=all as well?

There is an error in the code from the change that I made that switched from reading a 4k block to passing the fd to ML_(check_macho_and_get_rw_loads) which now allocates enough for the segment commands. The close is only in the non-Darwin code block. I'll fix that shortly.

@paulfloyd
Copy link
Contributor

Pushed a fix. I also need the following small change

static Bool check_fat_macho_and_get_rw_loads(const void* macho_header, Int* rw_loads)
@@ -849,13 +849,10 @@ Bool ML_(read_macho_debug_info)( struct DebugInfo* di )
from_memory = True;
kernel_slide = VG
(dyld_cache_get_slide)();
}
+#endif
if (di->fsm.rw_map_count) {
have_rw = True;
}
-#else

  • vg_assert(di->fsm.rw_map_count);
  • have_rw = True;
    -#endif

(rw_map_count can now be zero)

Then I get

==8227== Nulgrind, the minimal Valgrind tool
==8227== Copyright (C) 2002-2024, and GNU GPL'd, by Nicholas Nethercote et al.
==8227== Using Valgrind-3.25.0.GIT-lbmacos and LibVEX; rerun with -h for copyright info
==8227== Command: /usr/bin/../../System/Library/Frameworks/Tk.framework/Versions/8.5/Resources/Wish.app/Contents/MacOS/Wish
==8227== Parent PID: 8223
==8227==
--8227-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option
--8227-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 2 times)
--8227-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 4 times)
--8227-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 8 times)
--8227-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 16 times)

valgrind: m_syswrap/syswrap-amd64-darwin.c:517 (void wqthread_hijack(Addr, Addr, Addr, Addr, UInt, Int, Addr)): Assertion 'tst->os_state.pthread - magic_delta == self' failed.

@LouisBrunner
Copy link
Owner

Thanks a lot @paulfloyd for looking into that.

@EricBrunel with all the changes merged on main, your program should run fine now (it works for me on amd64 15.0).

Feel free to report if you are still seeing issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
10.13 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants