-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
free(): invalid next size (fast) #142
Comments
Is there a core dump file generated? would you please paste the call stack here, so we can identify the root cause of it, thanks! |
How's this?
|
Thanks for the detailed information and the crash happened when GC executes, which I think the reason is complicated. Is it reproducible stably (as there are many dependencies, we cannot find the direct connection from the backtrace)? |
It seems to be fairly random. I could only reproduce it once yesterday in about 6 hours of testing, while this morning it happened on the first run of the software. |
Happened again, with a different backtrace this time...
|
And another. Not sure why I'm breaking this in multiple places, there must be something else going on
|
This looks weird that it doesn't happen when the GC executes only, but also some other cases (within libraries of rclnodejs or librmw_fastrtps_shared_cpp.so). Which release of rclnodejs and ROS2 do you use? (latest ones are ROS 2 Eloquent Elusor - Patch Release 1 and 0.13.0), and the node.js version? If the scale of your app is small, you could convert it into JS and run it by |
Versions: The app's pretty large, but I could try a minimal rclnodejs example that consumes the higher-throughput topics that seem to have aggravated the issue. |
I see, so I think you are running ROS2 with the docker image. We ever tested the stability of rclnodejs e.g. pub/sub some topics for several days, but not using the docker image. I agree that you could make a simple case to locate the potential causes, thanks for your support! |
An update: to mimic your case, I run the an example of rclnodejs which publishes a topic of type Publish 161396 messages.
Publish 161397 messages.
Publish 161398 messages.
Publish 161399 messages. The memory error doesn't happen after two-days running. |
Still trying to investigate this. I'm still getting periodic crashes from the application, but I haven't gotten any crashes in my attempt to reproduce the problem. I'm testing |
Any advice on how to install rclnodejs with debug symbols? I'm not super familiar with nodejs, but node-gyp keeps building it in release, even when I try to |
|
Still working on debugging this. I set the flag
Still working on getting rclnodejs built with debug symbols so I can get more info about frame #6 |
Thanks for keeping investigating this issue and it seems that the crash happens when executing |
Finally got rclnodejs building in debug. Generated this backtrace:
Looks like it was raised from the copy constructor called by https://github.com/RobotWebTools/rclnodejs/blob/1fc4e0f0ba05745e1dbe112d858d96b54d83d6b8/src/handle_manager.hpp#L54 ETA: Not sure if this provides any additional information, but I did some digging into the ready_handles_ member:
|
After testing a few more times, looks like the memory errors are always with the ready_handles_ vector. I've seen copy constructor and destructor failures. It looks like its always fine in the get_ready_handles() frame, but it enters std::vectors copy constructor/destructor sometime after that, where the vector is junk. The vector itself is copied (so it doesn't reference the member variable), so the ready_handles_ vector is fine. The copied vector is passed into ShadowNode::Execute by const reference. Is something about the async methods in ShadowNode::Execute somehow corrupting the vector? I'm not sure how the async stuff works, but could it enter the async methods with a valid vector, but by the time it leaves the async methods, the vector has been deallocated? |
Thanks for logging the useful information and it seems that we get a different backtrace compared with the previous ones. I notice the following information about the first element of the vector: children_ = std::set with 51029744 elements<error reading variable: Cannot access memory at address 0xffb81274e5894865> So, I think that may cause a crash because of memory corruption. |
|
Just to close this out - we wrapped up this project before I got to do more investigation on this. One additional thing I noticed was that this error rarely (but still occasionally) happened on a server with much better hardware than my development laptop. There very well could have been unoptimized web client code that was causing this app to use too many resources, or something. Sorry we couldn't figure this out! |
That's fine, thanks for what you have done to help investigate the root cause of the problem. Actually, based on the call stack you offered, I found a potential race condition when writing/reading data happens on different threads, and a patch has landed. Hopefully, we are going to have a patch release recently for the Node.js client (rclnodejs). You could have a try if you are still interested then :D Meanwhile, I want to reopen this problem because others may hit the same problem and they can give feedback through this thread or confirm if the issue is still reproducible. Thanks a lot for your time and effort! |
I'm getting a
free(): invalid next size (fast)
error in my implementation. I'm using ros2-web-bridge to interface a web page with my ROS2 backend, where the web page subscribes to about 10 topics. The fastest topics are published at 10Hz. None of the msgs are particularly large (no images or point clouds or anything), the biggest being about 6 topics published at 10Hz with 60-70 geometry_msgs/Points.The full error msg isn't very insightful - if you have recommendations on more verbose debugging, I'll give it a shot.
The text was updated successfully, but these errors were encountered: