Skip to content

Conversation

@DanielG
Copy link
Contributor

@DanielG DanielG commented Feb 2, 2021

Currently when a usb device is remove the libusb event thread, though the
hotplug handler, will cause the widget's Device to be deleted in the main
thread and wait for this to complete in the libusb thread.

If any of the destructors triggered by device deletion causes libusb_close
to be called this causes a deadlock as libusb_close will try to take over
handling of events in the main thread while the libusb thread is waiting
for libusb_close, via the destructors, to finish.

Just move both widget and device deletion to the main thread to prevent
this scenario.

@peternewman peternewman added this to the 0.11.0 milestone Feb 8, 2021
Copy link
Member

@peternewman peternewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fairly sensible to me, but I don't fully understand the finer detail of why it was done in the first place. Have you seen this issue in your code then?

It was done in #930 , specifically 70ec24f .

@nomis52 may be able to comment further if they see this.

@peternewman
Copy link
Member

@kripton you might be interested in this to avoid merge conflicts with your changes, and you may have a better idea of the code than me!

Copy link
Contributor Author

@DanielG DanielG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

Have you seen this issue in your code then?

Indeed I ran into this deadlock in our out-of-tree code. AFAICS there isn't an in-tree user that is currently triggering it because no one is crazy enough to have the usb tranciever (threads) be part of the (derrived) Port class, but there's nothing preventing it from happening in the future.

The approximate backtrace for the main thead (from memory) is something like this:

libusb_close()
....
LibUsbAdaptor::Close()
AsyncUsbSender::~AsyncUsbSender()
MyAsyncUsbSender::~MyAsyncUsbSender()
MyPort::~MyPort()
Device::DeleteAllPorts()
~Device::Stop()
AsyncPluginImpl::ShutdownDevice()
...

while the libusb thread is stuck waiting for the future f.Get() call in AsyncPluginImpl::DeviceEvent.

For details on the libusb_close behavour see the libusb_close definition libusb/core.c, it has this:

	/* Similarly to libusb_open(), we want to interrupt all event handlers
	 * at this point. More importantly, we want to perform the actual close of
	 * the device while holding the event handling lock (preventing any other
	 * thread from doing event handling) because we will be removing a file
	 * descriptor from the polling loop. If this is being called by the current
	 * event handler, we can bypass the interruption code because we already
	 * hold the event handling lock. */

	if (!handling_events) {
                [...]
		/* take event handling lock */
		libusb_lock_events(ctx);
	}

and IIRC libusb_lock_events() is where things deadlock.

If you'd like to see the details I could probably post the relevant parts of our code somewhere, it's just not really ready for inclusion is any way shape or form :)

@kripton
Copy link
Member

kripton commented Feb 8, 2021

@peternewman, thanks for pinging me here. I can also follow the reasoning of the change but just as you, I wouldn't know if it breaks something. I agree that @nomis52 might be able to comment on it. However since I assume it worked before #930 and nomis52 might just not have been aware of the problem, this could just be merged.
About merge conflicts with the changes I did in my DMX16 branch: true, that would be affected. Currently, I don't plan to create a PR for them since I'm not quite happy with the dongles performance. There are some options pending (like a USB serial based approach instead of USB HID) to be played around with :) Detailed reasoning will follow on the mailing list topic tomorrow.

@kripton
Copy link
Member

kripton commented Feb 8, 2021

Oh, and I don't have any USB DMX dongle to actually check the code changes ;)
The RP2040-based dongle I made those changes for only works when it is plugged in before olad is started. So the unplugging and hotplug cases are hard to test using that dongle.
Most probable reason (a bit off-topic): The RP2040 dongle has the HID and CDC ACM interface. When libusb sees the device on hotplug, it cannot read the serial number since the kernel is about to configure the ACM interface. libusb tells me that the device is busy ...

Oh, just thinkig about it: I have one old Anyma uDMX around that I can test the code with :D

@DanielG
Copy link
Contributor Author

DanielG commented Feb 8, 2021

When libusb sees the device on hotplug, it cannot read the serial number since the kernel is about to configure the ACM interface. libusb tells me that the device is busy ...

I think you want to be calling adaptor->DetachKernelDriver(usb_handle, <USB iface num>) before trying to access that stuff, see DMXCProjectsNodleU1.cpp for an example.

@DanielG
Copy link
Contributor Author

DanielG commented Feb 8, 2021

I did test this patch together with our out-of-tree widget support code and it works just fine. I don't have any other supported usbdmx box either though, just some usbpro ones.

@DanielG
Copy link
Contributor Author

DanielG commented Feb 8, 2021

@kripton I just found your branch that Peter was talking about (I think): master...kripton:RP2040UsbDmx. I would ask you to consider submitting that as a PR if only so I can do some review. I also have some muti-universe support for usbdmx in my (private) patch queue using a different approach and I think we might want to hammer out a way to do this that works for both of us.

@kripton
Copy link
Member

kripton commented Feb 9, 2021

@DanielG: I created a Draft-PR as #1713. I'm curious to see what your solution looks like 👍

@peternewman
Copy link
Member

Indeed I ran into this deadlock in our out-of-tree code. AFAICS there isn't an in-tree user that is currently triggering it because no one is crazy enough to have the usb tranciever (threads) be part of the (derrived) Port class, but there's nothing preventing it from happening in the future.

If you'd like to see the details I could probably post the relevant parts of our code somewhere, it's just not really ready for inclusion is any way shape or form :)

No need for that, if you've hit the issue, made the change and its fixed it, that's good enough for me.

Copy link
Member

@peternewman peternewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So from my perspective, LGTM, if we could fix those two minor issues then I'm happy to merge.

Logistics wise, I know #1627 slipped through the net before, is there any benefit to that going in before or afterwards?

@DanielG
Copy link
Contributor Author

DanielG commented Feb 13, 2021

So from my perspective, LGTM, if we could fix those two minor issues then I'm happy to merge.

Awesome, I'll try to get a new revision out by next week.

Logistics wise, I know #1627 slipped through the net before

No worries, I never had time to fix the CI failures there but I'll get that cleaned up.

FYI @kripton you might want to have a look at that PR since your branch has a vaguely similar multi-transfer mechanism.

is there any benefit to that going in before or afterwards?

I don't think the order makes any difference.

@peternewman
Copy link
Member

Logistics wise, I know #1627 slipped through the net before

No worries, I never had time to fix the CI failures there but I'll get that cleaned up.

Looks like it may just be a space around a comment or something trivial!

Copy link
Member

@peternewman peternewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM within my knowledge of the code @DanielG

Happy to merge given the tests before and afterwards that you've done. Would you mind resolving the two outstanding nits please.

@peternewman peternewman assigned DanielG and unassigned peternewman Mar 3, 2021
@DanielG DanielG force-pushed the usbdmx-widget-deletion branch 2 times, most recently from 9e478b5 to 5aeaf22 Compare November 28, 2021 18:36
@DanielG
Copy link
Contributor Author

DanielG commented Nov 28, 2021

@peternewman This is ready for another round of review, also the CI seems to need approval: "First-time contributors need a maintainer to approve running workflows." despite this not being my first rodeo. Weird.

Copy link
Member

@peternewman peternewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry just one more minor style nit please.

Currently when a usb device is remove the libusb event thread, though the
hotplug handler, will cause the widget's Device to be deleted in the main
thread and wait for this to complete in the libusb thread.

If any of the destructors triggered by device deletion causes libusb_close
to be called this causes a deadlock as libusb_close will try to take over
handling of events in the main thread while the libusb thread is waiting
for libusb_close, via the destructors, to finish.

Just move both widget and device deletion to the main thread to prevent
this scenario.
@DanielG DanielG force-pushed the usbdmx-widget-deletion branch from 5aeaf22 to 3faa662 Compare November 28, 2021 23:28
@DanielG
Copy link
Contributor Author

DanielG commented Nov 28, 2021

No worries, thanks for the quick review :)

Copy link
Member

@peternewman peternewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @DanielG !

I assume you've tested the latest iteration?

It would be nice for backwards compatibility if we could keep the old method too, but I'm guessing we can't do that without risking causing the issue we're trying to fix!

@DanielG
Copy link
Contributor Author

DanielG commented Dec 6, 2021

Just to be clear: I have not tested the post-review version. I think we only made cosmetic changes though so I wouldn't expect any breakage having been introduced. Unfortunately my test bed for this is torn down ATM. So if you want it tested again it'll have to sit around until I get around to the project this came from again.

As for your backwards compat point, yeah, the whole point is to prevent a deadlock so I don't see how we can do that while keeping DeleteWidget in the libusb thread.

@peternewman peternewman merged commit 31bd61c into OpenLightingProject:master Dec 6, 2021
@peternewman
Copy link
Member

Just to be clear: I have not tested the post-review version. I think we only made cosmetic changes though so I wouldn't expect any breakage having been introduced.

Yet another reason to hate force pushing! :)

Unfortunately my test bed for this is torn down ATM. So if you want it tested again it'll have to sit around until I get around to the project this came from again.

Okay I've merged it anyway, hopefully myself or someone will notice fairly sharpish if this breaks things!

As for your backwards compat point, yeah, the whole point is to prevent a deadlock so I don't see how we can do that while keeping DeleteWidget in the libusb thread.

Yep, makes sense, just thought I should check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants