Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory faults on subscribers #668

Open
Affonso-Gui opened this issue May 20, 2021 · 6 comments · May be fixed by #675
Open

Memory faults on subscribers #668

Affonso-Gui opened this issue May 20, 2021 · 6 comments · May be fixed by #675

Comments

@Affonso-Gui
Copy link
Member

I have been struggling with weird segfaults on my custom build for some time but I believe I have finally found a way to reproduce them in the release build:

  1. rostopic pub /msg std_msgs/String "hello" -r 100
  2. Execute the following
(ros::roseus "test")

(defclass my-sub :slots (msg))

(defmethod my-sub
 (:init () self)
 (:update (_msg) (setq msg _msg))
 (:get () msg))

(defun my-one-shot (topic msgtype)
 (let ((sub (instance my-sub :init)))
   (ros::subscribe topic msgtype #'send sub :update)
   (while (not (send sub :get))
     (ros::spin-once))
   (send sub :get)))

(defun foo ()
 (let ((i 0))
   (while t
     (if (zerop (mod (incf i) 100)) (print i))
     (my-one-shot "/msg" std_msgs::String))))

(foo)

This will execute normally at first but eventually cause a segmentation fault if ran for long enough.
Testing with the 1.7.4-1bionic.20210505.033302 and 1.7.4-1bionic.20200930.061900 builds several times the segmentation fault occurred around 900, 2400 and 8600 iterations, depending on each run.

What I think the problem is

The subscribers registered in this case look something like (#<compiled-code #X5610a667fe40 send> #<my-sub #X5610a6cdb6c8> :update) and are attributed to symbols in the lisp package to avoid being garbage collected.
However, I believe that the object instances within the registered and supposedly guarded list are actually succeptible to being garbage collected, leading to the observed segmentation faults.
I have also verified that the errors always occur just after a garbage collection.

@708yamaguchi Could you see if you can recreate the situation in your machine?

@708yamaguchi
Copy link
Member

Yes, the problem is reproduced on my machine as well.
My environment is as follows:

$ apt show ros-melodic-roseus
Package: ros-melodic-roseus
Version: 1.7.4-1bionic.20210505.033302
...

In my environment, I called the (foo) function 10 times while restarting roseus, all of which printed 900 and caused a segmentation fault.

1.irteusgl$ (foo)
100
200
300
400
500
600
700
800
900
Call Stack (max depth: 20):
  0: at (ros::spin-once)
  1: at (ros::spin-once)
  2: at (while (not (send sub :get)) (ros::spin-once))
  3: at (let ((sub (instance my-sub :init))) (ros::subscribe topic msgtype #'send sub :update) (while (not (send sub :get)) (ros::spin-once)) (send sub :get))
  4: at (my-one-shot "/msg" std_msgs::string)
  5: at (while t (if (zerop (mod (incf i) 100)) (print i)) (my-one-shot "/msg" std_msgs::string))
  6: at (let ((i 0)) (while t (if (zerop (mod (incf i) 100)) (print i)) (my-one-shot "/msg" std_msgs::string)))
  7: at (foo)
  8: at #<compiled-code #X560a24deee18>
;p=pointer?(0x560a24d2c078)
;; Segmentation Fault.
;; in /opt/ros/melodic/share/euslisp/jskeus/eus/Linux64/bin/irteusgl 0 error: cannot find method euserror
;; You are still in a signal handler.
;;Try reset or throw to upper level as soon as possible.
;; code=-1472911824 x=a8352500 addr=0
Fatal: 

@YoheiKakiuchi
Copy link
Member

It is a workaround, I think.
'gensym' may occure GC.
So, we should push args(and other objects) to stack.
Then, pop it.

diff --git a/roseus/roseus.cpp b/roseus/roseus.cpp
index f0b4a19..b27ca20 100644
--- a/roseus/roseus.cpp
+++ b/roseus/roseus.cpp
@@ -369,8 +369,10 @@ public:
       ROS_ERROR("subscription callback function install error");
     }
     // avoid gc
+    vpush(args);
     pointer p=gensym(ctx);
     setval(ctx,intern(ctx,(char*)(p->c.sym.pname->c.str.chars),strlen((char*)(p->c.sym.pname->c.str.chars)),lisppkg),cons(ctx,scb,args));
+    vpop(); // pop agrs
   }
   ~EuslispSubscriptionCallbackHelper() {
       ROS_ERROR("subscription gc");

@Affonso-Gui
Copy link
Member Author

@YoheiKakiuchi I have previously attempted a similar approach in #636 and was kind of uncertain if it made any sense, but I guess that it actually is a valid solution for this error then!
With these changes I was able to run 40,000+ iterations without any problem.

@YoheiKakiuchi
Copy link
Member

YoheiKakiuchi commented May 20, 2021

Here is the correct place to fix this problem.

@@ -870,11 +872,12 @@ pointer ROSEUS_SUBSCRIBE(register context *ctx,int n,pointer *argv)
   fncallback = argv[2];
   args=NIL;
   for (int i=n-1;i>=3;i--) args=cons(ctx,argv[i],args);
-
+  vpush(args);
   EuslispMessage msg(message);
    boost::shared_ptr<SubscriptionCallbackHelper> *callback =
      (new boost::shared_ptr<SubscriptionCallbackHelper>
       (new EuslispSubscriptionCallbackHelper(fncallback, args, message)));
+  vpop();
   SubscribeOptions so(topicname, queuesize, msg.__getMD5Sum(), msg.__getDataType());
   so.helper = *callback;
   Subscriber subscriber = lnode->subscribe(so);

'args' is created on source of C-lang, so 'args' is not protected anywhere.
Then 'args' may be collected as a garbege.
( objects in argv, such as 'sub' in this case, are not collected )
fncallback may also be protected because it is arguments of function.

@Affonso-Gui
Copy link
Member Author

This fix also seems stable!

I see, so it is not exactly the object instance being collected but the list as a whole, right?
I'll do some more testing tomorrow and (re)organize a PR if everything looks fine.

Also any insights if we should dump the lisp symbol on unsubscriptions?
Registering on subscriptions and never unregistering them does seem like a memory leak...

@YoheiKakiuchi
Copy link
Member

Yes, list(cons) object is collected. objects in list are safe.

Can we delete symbol ? I think reuse of symbol is realistic.
Repeating subscribe many times is non sense, so it's not really a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants