-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(discovery-client): allow discovery client to track network interfaces #7608
Conversation
…ace changes Rather than rely on its caller, this change allows the Discovery Client to check the system's network interfaces to determine if its mDNS browser needs to be restarted to bind on any newly added interfaces (or to flush out any removed interfaces). This change is necessary to increase Discovery Client reliability when robots connected to U2E adapters are plugged and unplugged from the user's computer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
The basic idea of this PR—move responsibility of keeping track of interfaces to inside the discovery client—sounds great to me.
-
I went through the basic test procedure. What I saw matched what you described.
-
I've only very briefly looked at the code, so far.
-
I've added app logging to record when these restarts happen, so we should carefully study these logs and, if possible, stress test this change somehow.
How will we do this? I thought only robot logs get forwarded to us over the Internet?
originalServiceTypeFromString.apply(this, args) | ||
} catch (error) { | ||
console.warn(error) | ||
if (!interfacesMatch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it right to use the set of interface names changing across polls as the criterion for restarting the mDNS browser, like this?
For example, this sequence:
- Poll the OS's current set of network interface names.
- An interface goes down.
- That interface comes back up.
- Again poll the OS's current set of network interface names.
The set of network interface names doesn't change between steps 1 and 4. But the interface still goes down and comes back up.
I guess this hinges on what it really means for the mDNS browser to "bind to network interfaces when it's created." What's it latching onto? An IP? An OS handle to a hardware device? Is the handle invalidated when the hardware device is removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"network interfaces" are logical interfaces to hardware, with many-to-many bindings to actual hardware and occasionally fully logical implementations - consider a loopback iface, or a bridge iface that really only exists to route packets between two other ifaces. Depending on the os, names may be unique plug-to-plug; more often, they won't be. interfaces
is a wrapper around node's os.networkInterfaces in particular, which appears to elide any os-provided unique ids for ifaces even if they do happen to exist. Looks like the implementation boils down to getifaddrs(3) on linux and similar on other platforms, so it'll get any iface with a bound address (and no others, which probably isn't a problem for us because the robot won't send mdns packets over an iface with no address anyway).
meanwhile, the problem we're trying to get around is
- mdns-js will open a socket on the appropriate port
- if the iface goes away, the socket will go away (and it only maybe gets reported)
- even if it does get reported, we're presumably not really detecting the problem
- so we kind of need to brute force it
this all adds up to, yeah, you could have a kind of problem with this implementation where the iface goes away and comes back all in between polls and you never detect the difference, but the socket has in fact gone away. in practice i think this may not actually happen because local ip namespaces take forever to establish; but it may be work looking into whether there's an error we can hook that will actually tell us when the socket goes away, if it exists (which it may not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is mostly problematic on link-local; I wonder if, in the absence of anything better, we could check the IP address of the interface, too? The computer's link-local address will likely change (I think?) on a U2E hotplug.
Looking into it a little bit more, we have access to the UDP socket, so we could (presumably) listen for the error
and close
events ourselves
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The computer's link-local address will likely change (I think?) on a U2E hotplug
Anecdotally, I think I've seen the IP address not change (and been surprised by that). But if the IP address doesn't change, maybe that also suggests that the original binding wasn't invalidated, either? Hmm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think being able to get the udp socket's events will help. What we expect to happen is
- the network interface goes away
- any ongoing
select
s or whatever should return an EIO or whatever - that should bubble up through the node runtime to the events and call back
The normal danger with sockets is that you can't really detect when the other side stops listening especially on udp, but because what we're detecting is the local interface going away, I'd hope that we get an error. That said, it's absolutely in the "hope so!" category, and may be platform-dependent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done some testing on macOS with our underlying mdns-js and the Node.js UDP sockets it exposes. Results are... not ideal.
- A Node.js UDP socket does not emit any sort of event on the socket itself if the interface goes away
- A given
send
will fail withEADDRNOTAVAIL
, though - Unlike socket-level events, we do not have a convenient way to capture send errors from
mdns-js
- Additionally, the mDNS library really only
send
s once for the initial discovery query at boot
- A given
- The race condition @sfoster1 described above is definitely possible and repeatable; if the interface goes down and comes back up within a interface check poll interval, we basically lose the socket for receiving mDNS packets
Based on this testing, I'm at a TL;DR that we'd have to write our own mDNS browser (possibly / probably not in Node.js) to actually cover our bases here. I don't want to do that (right now), so here's my thoughts:
- A 5-second polling interval for network interface changes is actually fairly effective, so I'll bring the interval down from 10 to 5
- Given the typical timing of interface bring-up on both ends of the U2E adapter, I have not successfully been able to sneak an iface up/down/up through this interval
- A polling interval in the range of ~1 min if definitely big enough to sneak through, though
- Our existing slow poll of network interfaces is 30 seconds, which feels big enough to sneak through
- If a robot reboots and comes back with the same IP address, we're fine because of the discovery client's HTTP health polling loop
- I think I will add IP address checking to the network interface comparison, just to cover an extra base if something sneaks past the 5 second poll interval
That leaves us with one case where we're in trouble:
- Robot reboots and self-assigns a new IP address
- PC self-assigns the same IP address
- The interface up/down/up on the computer-side sneaks past the 5-second check interval
In that case, we "lose" the robot. However, we only lose the robot until the user manually triggers a discovery re-check. Combined with the fact that full page restart spinners won't stick around forever (#7589), I think I'm willing to call that good enough until we can write our own mDNS browser to handle this properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think that's good enough. I think we can test whether it's a node problem or a "that's how sockets work" problem that we don't get an error if we're listening on a socket on an if that goes away, but I also think that's not super relevant.
Alright, this has been an... adventure. I think this is finally ready to go. Here are the latest updates:
With (1) and (2) combined, we get very reliable robot hotplugs. There's probably some timing tuning we could do, but as is, it will eventually pick up the robot. As an aside, it turns out this is a regression introduced by #6193 and removing the functionality called out in #5985. The "old" discovery client re-queried the mDNS browser on an interval for an entirely different purpose, inadvertently ensuring that the network interface checking added by #5933 actually solved the problem. When the re-query behavior was removed, the network interface checking that was in place was not sufficient to allow robots to be hotplugged. |
// this method can throw (without emitting), so we need to patch this up | ||
const originalServiceTypeFromString = ServiceType.prototype.fromString | ||
|
||
ServiceType.prototype.fromString = function (...args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we try and upstream this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah probably. Upstream has not been touched in a while and it's an old issue, but worth a shot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Looks good, hopefully this will be enough going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly over my head now, but, one thing:
Added an explicit re-query call on an (exponentially backed-off) interval
I don't understand why this is necessary. Even if the OT-2 misses the first query, shouldn't it automatically announce itself when it detects its network interface has come up?
Whenever a Multicast DNS responder starts up, wakes up from sleep, receives an indication of a network interface "Link Change" event, or has any other reason to believe that its network connectivity may have changed in some relevant way, it MUST perform the two startup steps below: Probing (Section 8.1) and Announcing (Section 8.3).
QA approved for more thorough testing in release test suite |
overview
Previously, #5933 added network interface checking to enable Discovery Client restarts when the computer's active network interface list changed. This was required because:
The implementation relied on the
app-shell
module to check the system's network interfaces, and if they changed, restart the entire discovery client.Either due to limitations in this implementation, or due to changes to the discovery client since then, this functionality either isn't working as well as we thought it was or has potentially broken entirely, as confirmed by real world testing.This worked, but only because the Discovery Client was also periodically re-sending its discovery query on an interval, which would ensure that once the DC picked up a new interface, it would (eventually) issue a discovery query on that interface. This was not an intentional effect of the re-query functionality, so when the re-query was removed according to #5985 and #6193 added fairly aggressive discovery cache cleanup, robot rediscovery over the U2E interface became a lot less reliable.
This PR fixes the "hotplug" functionality, but on purpose this time. Related to #6585, in that this problem appears to be a root cause of the app never seeing an updated robot.
changelog
Rather than rely on its caller, this PR pulls the network interface check directly into the Discovery Client in order to:
It also:
This brings the following benefits:
Review requests
Basic test procedure with Discovery Client CLI:
make -C discovery-client && yarn discovery browse --logLevel debug
Basic test procedure with app:
debug: Restarting mDNS due to network interface mismatch { extra: [], missing: [ ... ] }
Risk assessment
Medium. There's a chance here that this mDNS browser restart strategy is overly aggressive in a way that could result in either excessive network load, missing valid advertisements due to frequent restarts, or both.
I've added app logging to record when these restarts happen, so we should carefully study these logs while testing and, if possible, stress test this change somehow.
If we saw the log happening every 5 seconds (its poll interval) on a given computer, that would be a red flag and force us to reconsider this approach.