-
Notifications
You must be signed in to change notification settings - Fork 81
Fix potential deadlocks from execution of mutex lock and release #490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
|
I took a look at the code changes, it seems reasonable to add an additional filter to make sure that we account for robots currently on a lane during replanning. I'm testing this PR out, and though it was working great without deadlock for some time, after tweaking the graph a little (not the mutexes, just spaced out the charging stations at the bottom further apart), I can consistently create a deadlock where TinyRobot4 starts waiting for
Relevant logs are available here, and I'm copying the suspicious lines below, where it suggests that TinyRobot4 should be moving up towards For convenience I pushed my test case maps/configs to this xiyu/test_mutex branch, but it requires some setup to use them. The launch command is: and I've also added I'm not entirely sure if this is a map setup problem, because it had worked successfully when I drew the charger waypoints differently (closer together, with a straight lane connecting all leaf nodes towards the chargers), but the replanning portion is a little weird to me. |
|
@xiyuoh thanks for finding this additional case. I think it's worth opening a PR for this map you created in I believe what's happening in the deadlock case that you've found is The only solution I can think of for this is that |

This PR fixes #448
I found two unrelated problems which had a risk of causing deadlocks in mutex usage:
1. Premature release of a mutex after replanning
The
MoveRobotphase is set up to automatically release mutexes that are no longer needed by the remaining movement that the robot is performing. In general this works fine since a robot should only need to retain mutexes that it is currently moving through.However, there is an edge case: If replanning gets forced by a negotiation then a new plan may involve one initial step where the robot is commanded to the midpoint of a lane. In the current implementation, this midpoint is not explicitly associated with any part of the graph, i.e. it has no "graph elements" to reference to determine what mutexes should be locked or not. This leaves the
MoveRobotphase unaware that it needs to retain the mutex of the lane that the robot is currently on, nor the mutex of any vertex that the robot is moving towards. Without that awareness, theMoveRobotphase will naturally release those mutexes.In this PR, we now detect this edge case directly, and we block the
MoveRobotphase from doing any automatic release of mutexes when it's handling this type of midpoint. This is a very blunt force approach to solving the problem, but I don't think we'll find a better approach until we reimplement the traffic system to do geometric reasoning instead of pure graph vertex/edge reasoning.2. Incorrect cleanup of
current_mutex_groupsinExecutePlan::makeThe logic when creating the plan execution is terribly convoluted because of the need to iterate through waypoints while only inserting events as needed. The variable
current_mutex_groupsis meant to track which mutex groups need to be locked before the next phase can be started, but its value was not getting reset tonulloptafter being used in certain cases. This meant that the plan would involve repeatedly trying to relock mutexes that had just been correctly released.The fix proposed in this comment does have the effect of forcing
current_mutex_groupsto be reset, which is why it appeared to resolve this part of the deadlocking problem. However it also has a side-effect of forcing aLockMutexGroupphase in between every waypoint in the plan, which is why I did not go with that proposed solution.Other notes
Besides fixing the above problems, I've left some debug logging in this PR with the information that helped me resolve this issue. If we notice anything else suspicious about the mutex behavior in the future, those debug outputs should help resolve it a bit faster.