Fix potential deadlocks from execution of mutex lock and release #490

mxgrey · 2025-11-19T13:40:42Z

This PR fixes #448

I found two unrelated problems which had a risk of causing deadlocks in mutex usage:

1. Premature release of a mutex after replanning

The MoveRobot phase is set up to automatically release mutexes that are no longer needed by the remaining movement that the robot is performing. In general this works fine since a robot should only need to retain mutexes that it is currently moving through.

However, there is an edge case: If replanning gets forced by a negotiation then a new plan may involve one initial step where the robot is commanded to the midpoint of a lane. In the current implementation, this midpoint is not explicitly associated with any part of the graph, i.e. it has no "graph elements" to reference to determine what mutexes should be locked or not. This leaves the MoveRobot phase unaware that it needs to retain the mutex of the lane that the robot is currently on, nor the mutex of any vertex that the robot is moving towards. Without that awareness, the MoveRobot phase will naturally release those mutexes.

In this PR, we now detect this edge case directly, and we block the MoveRobot phase from doing any automatic release of mutexes when it's handling this type of midpoint. This is a very blunt force approach to solving the problem, but I don't think we'll find a better approach until we reimplement the traffic system to do geometric reasoning instead of pure graph vertex/edge reasoning.

2. Incorrect cleanup of `current_mutex_groups` in `ExecutePlan::make`

The logic when creating the plan execution is terribly convoluted because of the need to iterate through waypoints while only inserting events as needed. The variable current_mutex_groups is meant to track which mutex groups need to be locked before the next phase can be started, but its value was not getting reset to nullopt after being used in certain cases. This meant that the plan would involve repeatedly trying to relock mutexes that had just been correctly released.

The fix proposed in this comment does have the effect of forcing current_mutex_groups to be reset, which is why it appeared to resolve this part of the deadlocking problem. However it also has a side-effect of forcing a LockMutexGroup phase in between every waypoint in the plan, which is why I did not go with that proposed solution.

Other notes

Besides fixing the above problems, I've left some debug logging in this PR with the information that helped me resolve this issue. If we notice anything else suspicious about the mutex behavior in the future, those debug outputs should help resolve it a bit faster.

Signed-off-by: Michael X. Grey <[email protected]>

xiyuoh · 2025-11-21T07:49:59Z

I took a look at the code changes, it seems reasonable to add an additional filter to make sure that we account for robots currently on a lane during replanning.

I'm testing this PR out, and though it was working great without deadlock for some time, after tweaking the graph a little (not the mutexes, just spaced out the charging stations at the bottom further apart), I can consistently create a deadlock where TinyRobot4 starts waiting for mutex_3 after a replan.

Relevant logs are available here, and I'm copying the suspicious lines below, where it suggests that TinyRobot4 should be moving up towards loc_6 instead of move down towards the charger:

[fleet_adapter-19] [INFO] [1763709514.899506501] [TinyRobot_fleet_adapter]: Replanning for [TinyRobot/TinyRobot_4] after locking mutexes [mutex_4][mutex_5] because the recommended plan has changed from [loc_8][loc_10][#21][#24][#25][TinyRobot_4_charger] to [loc_8][loc_10][#21][#7][#23][#24][#25][TinyRobot_4_charger]
[fleet_adapter-19] [INFO] [1763709514.899614731] [TinyRobot_fleet_adapter]: Replanning requested for [TinyRobot/TinyRobot_4]
[fleet_adapter-19] [INFO] [1763709514.899701921] [TinyRobot_fleet_adapter]: Planning for [TinyRobot/TinyRobot_4] to [TinyRobot_4_charger] from one of these locations:
[fleet_adapter-19]  -- lane 8: { L1 <7.41765 2.27364> [loc_6] [mutex: mutex_3] } -> { L1 <  7.41765 -0.165181> [loc_8] [mutex: mutex_4] } [mutex: mutex_3] | location <  7.40485 -0.165181> | orientation -3.73284e-05
[fleet_adapter-19]  -- lane 9: { L1 <  7.41765 -0.165181> [loc_8] [mutex: mutex_4] } -> { L1 <7.41765 2.27364> [loc_6] [mutex: mutex_3] } [mutex: mutex_3] | location <  7.40485 -0.165181> | orientation -3.73284e-05
[fleet_adapter-19]  -- L1 <  7.41765 -0.165181> [loc_8] [mutex: mutex_4] | location <  7.40485 -0.165181> | orientation -3.73284e-05
[fleet_adapter-19]  -- lane 54: { L1 <  7.41765 -0.165181> [loc_8] [mutex: mutex_4] } -> { L1 <   0.8548 -0.165181> [loc_7] [mutex: mutex_4] } [mutex: mutex_4] | location <  7.40485 -0.165181> | orientation -3.73284e-05
[fleet_adapter-19]  -- lane 55: { L1 <   0.8548 -0.165181> [loc_7] [mutex: mutex_4] } -> { L1 <  7.41765 -0.165181> [loc_8] [mutex: mutex_4] } [mutex: mutex_4] | location <  7.40485 -0.165181> | orientation -3.73284e-05

For convenience I pushed my test case maps/configs to this xiyu/test_mutex branch, but it requires some setup to use them. The launch command is:

ros2 launch rmf_site_demos test_mutex.launch.xml

and I've also added charger.sh and send_robots_back.sh to send patrol tasks to all 5 robots to the chargers and start locations respectively.

I'm not entirely sure if this is a map setup problem, because it had worked successfully when I drew the charger waypoints differently (closer together, with a straight lane connecting all leaf nodes towards the chargers), but the replanning portion is a little weird to me.

mxgrey · 2025-11-24T07:34:55Z

@xiyuoh thanks for finding this additional case. I think it's worth opening a PR for this map you created in rmf_site_ros2 so we have this as a test case that we can refer back to.

I believe what's happening in the deadlock case that you've found is TinyRobot_4 is within the "merge lane" distance of the { loc_6 -> loc_8 } lane, making its current position on that lane an "acceptable" starting point for the planner. Geometrically the cost calculation is no worse whether the robot starts "on" that lane or "on" the loc_8 waypoint since it's still the same point in space. The planner will end up arbitrarily choosing between those two semantically different starting points because the apparent cost is the same between them. Unfortunately the { loc_6 -> loc_8 } lane has the constraint that it needs to have mutex_3 locked, which is something that doesn't get accounted for in the cost calculation or anywhere in the planner.

The only solution I can think of for this is that LockMutexGroup needs to forcibly override the start points used by the planner for as long as it's active. I'll try that out and see if it resolves this scenario.

mxgrey added 2 commits November 19, 2025 02:43

Fix mutex lock placement within execution

da456f1

Signed-off-by: Michael X. Grey <[email protected]>

Clean up debug code

7245791

Signed-off-by: Michael X. Grey <[email protected]>

mxgrey added this to PMC Board Nov 19, 2025

github-project-automation bot moved this to Inbox in PMC Board Nov 19, 2025

mxgrey mentioned this pull request Nov 19, 2025

[Bug]: Title: Unexpected deadlock caused by Mutex #448

Open

1 task

Merge branch 'main' into fix_mutex_lock_execution

846b5db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix potential deadlocks from execution of mutex lock and release #490

Fix potential deadlocks from execution of mutex lock and release #490

Uh oh!

mxgrey commented Nov 19, 2025

Uh oh!

xiyuoh commented Nov 21, 2025

Uh oh!

mxgrey commented Nov 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix potential deadlocks from execution of mutex lock and release #490

Are you sure you want to change the base?

Fix potential deadlocks from execution of mutex lock and release #490

Uh oh!

Conversation

mxgrey commented Nov 19, 2025

1. Premature release of a mutex after replanning

2. Incorrect cleanup of current_mutex_groups in ExecutePlan::make

Other notes

Uh oh!

xiyuoh commented Nov 21, 2025

Uh oh!

mxgrey commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2. Incorrect cleanup of `current_mutex_groups` in `ExecutePlan::make`

mxgrey commented Nov 24, 2025 •

edited

Loading