Skip to content

Debugging RT components

Michael Haberler edited this page Apr 25, 2015 · 17 revisions

update: please use the NOSIGHDLR feature introduced in commit 59317defb7

When debugging RT as per below, before startup export NOSIGHDLR=1 at the shell level, which greatly eases debugging the demons (haltalk, rtapi_app, webtalk etc) with gdb.

Issues in RT components are harder to debug than normal user programs, but with the user thread styles it is in fact much easier than with RTAI. This note explains how to debug fatal errors in RT space, either "live" or "post-mortem", provided a core dump is available. I will focus exclusively on userland thread styles (posix, xenomai and rt-preempt).

The RT demons

There are two key processes which are needed to run RT applications: "msgd" and "rtapi".

  • msgd a process which starts first and prepares the RT environment, by setting up the global shared memory segment, initiating the log ringbuffer, and then retrieving any messages which might come out of RT components. msgd eventually writes these logs typically to /var/log/linuxcnc.log via syslog. So if something unusual happens, this log is the first place to look for issues in RT land.

  • rtapi_app is the process which actually loads RT components, and they live as shared libraries in its process space. This process also starts and runs RT threads, which is why it needs to have higher privileges (msgd doesnt); it is setuid root. rtapi_app starts after msgd.

Neither runs without the other; the reason why there are two: different privileges. msgd is the same binary for all thread flavors; rtapi_app is actually a different binary for each flavor.

Both programs are directly or indirectly started by the 'realtime' script; the 'linuxcnc' script does this behind the scenes. Those are background processes which dont write to the terminal session - they write to the log file.

Naming

The RT programs live in the 'libexec' directory as 'libexec/rtapi_msgd' and 'libexec/rtapi_app_<flavor>' respectively. This becomes important when we debug a core file post-mortem. So for example for xenomai, the rtapi pathname is libexec/rtapi_app_xenomai.

As soon as they are started, they change their name in "ps" to "msgd:0" and "rtapi:0" - might sound confusing, but this is needed because RTAPI nowadays supports multiple instances in parallel, and these need to be distinguished somehow - so the ":0" is the "rtapi instance id". For now, lets assume this will always be 0.

Identifying rtapi and msgd in the log

The log can be a bit verbose, but the key lines are:

Aug 23 22:33:03 nwheezy msgd:0: startup pid=12081 flavor=xenomai rtlevel=5 usrlevel=5 halsize=512000 shm=Posix gcc=4.7.2 version=v2.7.0~pre0~machinetalk-preview-4~eeb477e
...
Aug 23 22:33:03 nwheezy msgd:0: rtapi_app:12087:user rtapi_app:0 ready flavor=xenomai gcc=4.7.2 git=v2.7.0~pre0~machinetalk-preview-4~eeb477e

So msgd is pid 12081, and rtapi is pid 12087.

Diagnosing an error in RT land

When somthing unusual happens, it’s usually logged. For instance, a fatal error in an RT component might produce this:

Aug 23 23:10:59 nwheezy msgd:0: rtapi_app:7229:user signal 11 - 'Segmentation fault' received, dumping core (current dir=/home/mah/machinekit-tutorial/src/tests)
Aug 23 23:10:59 nwheezy msgd:0: rtapi_app exit detected - scheduled shutdown
Aug 23 23:11:01 nwheezy msgd:0: msgd shutting down

As a consequence of rtapi_app dying, msgd also shuts down; nothing it could do any more.

Note that rtapi_app (which is rtapi:0 in ps) logged its current directory; thats where it would write a core dump to by default).

Debugging strategy

Ideally you narrow down the problem to for example loading a single component. It is best to do this without the "linuxcnc" script, but a minimal halcmd script which leads up to the error.

In this case, we can trap an error as it happens. If this is not possible, inspecting the core dump is needed.

A faulty test component

To demonstrate debugging, we’ll use a component whose only purpose is to create a fault during loading, or in an RT function as needed:

component die;
pin out u32 dummy   "to keep comp happy";

function _;
option extra_setup;
option singleton yes;
license "GPLv2 or later";

;;

static int inthread = 0;
RTAPI_MP_INT(inthread, "if 0, segfault in rtapi_app_main(); if nozero, segfault in thread function");

static char *foo;  // deliberate NULL pointer

FUNCTION(_) {
    if (inthread)
	*foo = 't';
}

EXTRA_SETUP() {
    if (!inthread)
	*foo = 'm';
    return 0;
}

Build and install this with 'comp --install die.comp'.

We can now trip RT by doing a 'halcmd loadrt die' which will cause a null pointer dereference during setup by default.

If loaded with 'halcmd loadrt die inthread=1' it will do the same but in the RT thread it is executing:

# in src/tests/segfault-rt-thread.hal
loadrt die inthread=1
loadrt threads
addf die thread1
# starting the thread will create a segfault  of rtapi_app:
start

Trapping an error live

Assume you’ve narrowed down the issue to a few halcmd lines. In this case, the steps are:

  • start realtime with highest logging: "DEBUG=5 realtime start"

  • identify the rtapi_app process: by "pidof rtapi:0" or looking into the log as above

  • attach gdb to the running rtapi_app, this must be done as root: 'sudo gdb -p <rtapi_app PID>'

  • gdb will attach to the running process, and stop it. Tell gdb to continue by typing 'continue'.

  • in a separate window, run the halcmd script which will trip the error (in the example below it was 'halcmd loadrt die')

  • if rtapi_app catches a fault, gdb should respond and showing a command prompt and give an indication where the error occurred:

root@nwheezy:/home/mah/machinekit-tutorial/src# pidof rtapi:0
7492
root@nwheezy:/home/mah/machinekit-tutorial/src# gdb -p 7492
GNU gdb (GDB) 7.4.1-debian
... lots of gibberish about loading and symbols...
Loaded symbols for /lib/i386-linux-gnu/i686/cmov/libnss_files.so.2
0xb76f9424 in __kernel_vsyscall ()
(gdb) directory tests/                       # optional - tell gdb where to look for source files
Source directories searched: /home/mah/machinekit-tutorial/src/tests:$cdir:$cwd
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0xb756cabd in extra_setup (__comp_inst=0xb6de9118, prefix=0xb756cafc "die", extra_arg=0) at die.comp:23
23		*foo = 'm';
(gdb) p foo   # where was foo pointing to?
$1 = 0x0      # no wonder.
(gdb) 

Found the culprit!

Analyzing a core dump

If rtapi_app dropped a core, it’s still possible to do a post mortem, but it’s a bit more involved due to the multithreaded nature of rtapi_app.

An example using the "die" component above, and the script to cause a segfault in RT:

mah@nwheezy:~/machinekit-tutorial/src/tests$ DEBUG=5 realtime restart
mah@nwheezy:~/machinekit-tutorial/src/tests$ halcmd -f segfault-rt-thread.hal
segfault-rt-thread.hal:1: Realtime module 'die' loaded
segfault-rt-thread.hal:2: Realtime module 'threads' loaded
segfault-rt-thread.hal:3: Function 'die' added to thread 'thread1'
segfault-rt-thread.hal:6: Realtime threads started

The log says something like this:

Aug 23 23:52:06 nwheezy msgd:0: rtapi_app:7852:user signal 11 - 'Segmentation fault' received, dumping core (current dir=/home/mah/machinekit-tutorial/src/tests)

In fact there’s a core dump as indicated:

mah@nwheezy:~/machinekit-tutorial/src/tests$ ls -l core
-rw------- 1 root mah 49106944 Aug 23 23:52 core

So run gdb as root like so: "gdb <install path>/libexec/rtapi_app_<flavor> <path-to-coredump>"

Example session:

mah@nwheezy:~/machinekit-tutorial/src/tests$ sudo gdb ../../libexec/rtapi_app_xenomai ./core 
GNU gdb (GDB) 7.4.1-debian
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/mah/machinekit-tutorial/libexec/rtapi_app_xenomai...done.

warning: core file may not match specified executable file.
[New LWP 7852]
[New LWP 7853]
[New LWP 7854]
[New LWP 7861]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/i686/cmov/libthread_db.so.1".
Core was generated by `rtapi:0                                                             '.
Program terminated with signal 11, Segmentation fault.
#0  0xb7748424 in __kernel_vsyscall ()
(gdb) backtrace 
#0  0xb7748424 in __kernel_vsyscall ()
#1  0xb7184426 in *__GI___poll (fds=0xb721aff4, nfds=2, timeout=1000)
    at ../sysdeps/unix/sysv/linux/poll.c:87
#2  0xb73c39f2 in zmq_poll () from /usr/lib/i386-linux-gnu/libzmq.so.3
#3  0xb743de16 in zloop_start () from /usr/lib/i386-linux-gnu/libczmq.so.1
#4  0x0804ed78 in mainloop (argc=2, argv=0xbfc950f4) at rtapi/xenomai/rtapi_app.cc:975
#5  0x0804f6fa in main (argc=2, argv=0xbfc950f4) at rtapi/xenomai/rtapi_app.cc:1314
(gdb) 

That’s not very helpful because it doesnt point to the error, but rather some code in rtapi_app. But note that gdb listed four threads ("LWP" stands for light-weight-process - it’s the linux way of implementing threads). And gdb by default just shows the backtrace of the oldest thread. So let’s see what these threads are:

(gdb) info thread
  Id   Target Id         Frame 
  4    Thread 0xb6e37b70 (LWP 7861) 0xb7748424 in __kernel_vsyscall ()
  3    Thread 0xb6612b70 (LWP 7854) 0xb7748424 in __kernel_vsyscall ()
  2    Thread 0xb6e13b70 (LWP 7853) 0xb7748424 in __kernel_vsyscall ()
* 1    Thread 0xb6ff0710 (LWP 7852) 0xb7748424 in __kernel_vsyscall ()
(gdb) 

We were looking at thread #1 (7852). But the RT thread certainly was created after all the others, so that’s not the one we’re looking for.

What where these threads doing? Hard to tell. So let’s look at the call stack of these threads, maybe this gives an indication - usually the top four frames give a clue:

(gdb)  thread apply all where 4

Thread 4 (Thread 0xb6e37b70 (LWP 7861)):
#0  0xb7748424 in __kernel_vsyscall ()
#1  0xb715c106 in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#2  0xb715bf30 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#3  0x0804e58f in sigaction_handler (sig=11, si=0xb6e36e8c, uctx=0xb6e36f0c)
    at rtapi/xenomai/rtapi_app.cc:714
(More stack frames follow...)

Thread 3 (Thread 0xb6612b70 (LWP 7854)):
#0  0xb7748424 in __kernel_vsyscall ()
#1  0xb7192886 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82
#2  0xb738fb1b in ?? () from /usr/lib/i386-linux-gnu/libzmq.so.3
#3  0xb73ba23c in ?? () from /usr/lib/i386-linux-gnu/libzmq.so.3
(More stack frames follow...)

Thread 2 (Thread 0xb6e13b70 (LWP 7853)):
#0  0xb7748424 in __kernel_vsyscall ()
#1  0xb7192886 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82
#2  0xb738fb1b in ?? () from /usr/lib/i386-linux-gnu/libzmq.so.3
#3  0xb73ba23c in ?? () from /usr/lib/i386-linux-gnu/libzmq.so.3
(More stack frames follow...)

Thread 1 (Thread 0xb6ff0710 (LWP 7852)):
#0  0xb7748424 in __kernel_vsyscall ()
#1  0xb7184426 in *__GI___poll (fds=0xb721aff4, nfds=2, timeout=1000)
    at ../sysdeps/unix/sysv/linux/poll.c:87
#2  0xb73c39f2 in zmq_poll () from /usr/lib/i386-linux-gnu/libzmq.so.3
#3  0xb743de16 in zloop_start () from /usr/lib/i386-linux-gnu/libczmq.so.1
(More stack frames follow...)
(gdb) 

Aha! thread #4 was the one which caught the signal (it was executing "sigaction_handler", the signal handler function in rtapi_app), so let’s zoom in on #4:

(gdb) thread 4
[Switching to thread 4 (Thread 0xb6e37b70 (LWP 7861))]
#0  0xb7748424 in __kernel_vsyscall ()

and backtrace in more detail:

(gdb) backtrace 
#0  0xb7748424 in __kernel_vsyscall ()
#1  0xb715c106 in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#2  0xb715bf30 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#3  0x0804e58f in sigaction_handler (sig=11, si=0xb6e36e8c, uctx=0xb6e36f0c)
    at rtapi/xenomai/rtapi_app.cc:714
#4  <signal handler called>
#5  0xb75bba9a in _ (__comp_inst=0xb6e38118, period=1000000) at die.comp:18
#6  0xb6f51630 in thread_task (arg=0xb6eb4d48) at hal/lib/hal_lib.c:2824
#7  0xb6f635e0 in _rtapi_task_wrapper (task_id_hack=0x1) at rtapi/xenomai.c:261
#8  0xb75b2e59 in ?? () from /usr/lib/libnative.so.3
#9  0xb7593c39 in start_thread (arg=0xb6e37b70) at pthread_create.c:304
#10 0xb7191e1e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:130
(gdb) 

So in frame #5 we have the faulty RT function, which caught a signal - source line and all. Let’s see:

(gdb) up 5
#5  0xb75bba9a in _ (__comp_inst=0xb6e38118, period=1000000) at die.comp:18
18		*foo = 't';
(gdb) list
13	
14	static char *foo;  // deliberate NULL pointer
15	
16	FUNCTION(_) {
17	    if (inthread)
18		*foo = 't';                 // <<--- here!
19	}
20	
21	EXTRA_SETUP() {
22	    if (!inthread)
(gdb) 

Identified!

Preparing for debugging

As a rule of thumb when chasing down bugs, build the source tree with 'make clean; make OPT=-O0'. This disables optimization and makes it easier to step through code as programmed, not as the optimizer decided to rearrange or optimize out alltogether.

Where’s my core dump?

If a core file is written - and where - depends on several factors. The gory details are in 'man core' and this document might help: http://techvolve.blogspot.co.at/2014/05/gdb-debugging-core-dump-of-user-space.html.

I find it helpful to have coredumps written to a directory under /tmp, and have them tagged with command name and process id so they are easier to tell apart. Also, a reboot will clean them up. Note there’s a chance of filling /tmp with many core files, since /tmp is usually a RAM-based filesystem of limited size.

I go about it like so:

ulimit -S -c unlimited # no limit on core dump size
# make a directory in /tmp where coredumps are written
# run this as root
mkdir /tmp/coredump
chmod a+w /tmp/coredump
echo "/tmp/coredump/core.%e.%p" > /proc/sys/kernel/core_pattern
Clone this wiki locally