Skip to content

Conversation

JackDoan
Copy link
Collaborator

Fixes #1490 for real this time

@JackDoan JackDoan added this to the v1.10.0 milestone Oct 10, 2025
@JackDoan JackDoan requested a review from nbrownus October 10, 2025 19:55
@brad-defined
Copy link
Collaborator

Relays need to do an IP-based lookup during some jumps. In the reporter's case, the CreateRelayRequest's source IP was the host's IPv4 address, and the target was its IPv6 address. In addition, the target machine only had an IPv6 address.

When performing the IP-based lookup during relay queries, sendNoMetrics queries for every known and authenticated IP address associated with the looked-up HostInfo object:

// Try to send via a relay
		for _, relayIP := range hostinfo.relayState.CopyRelayIps() {
			relayHostInfo, relay, err := f.hostMap.QueryVpnAddrsRelayFor(hostinfo.vpnAddrs, relayIP) // <- uses hostinfo.vpnAddrs here!!
			if err != nil {
				hostinfo.relayState.DeleteRelay(relayIP)
				hostinfo.logger(f.l).WithField("relay", relayIP).WithError(err).Info("sendNoMetrics failed to find HostInfo")
				continue
			}
			f.SendVia(relayHostInfo, relay, out, nb, fullOut[:header.Len+len(out)], true)
			break
		}

If an IPv6-only host attempts to send to a peer who sent a CreateRelayRequest with both an IPv4 and an IPv6 address, this query should attempt to find the relay host info by either IP, and will be satisfied if any lookup works.

Relays ought to work, then, even if a host only has an IPv6 address installed, but received a CreateRelayRequest from an IPv4 address, as long as the handshake certificate included that IPv4 address in it.

I went searching through the handshake code, and found some updates that remove the IPv4 addresses from the handshake'd hostinfo, as the IPv4 addresses do not overlap with the running host's IPv6-only address.

diff --git a/handshake_ix.go b/handshake_ix.go
index 026bfbd..2509a34 100644
--- a/handshake_ix.go
+++ b/handshake_ix.go
@@ -203,9 +203,11 @@ func ixHandshakeStage1(f *Interface, addr netip.AddrPort, via *ViaSender, packet
                }
 
                // vpnAddrs outside our vpn networks are of no use to us, filter them out
-               if !f.myVpnNetworksTable.Contains(vpnAddr) {
-                       continue
-               }
+               /*
+                       if !f.myVpnNetworksTable.Contains(vpnAddr) {
+                               continue
+                       }
+               */
 
                filteredNetworks = append(filteredNetworks, network)
                vpnAddrs = append(vpnAddrs, vpnAddr)
@@ -578,9 +580,11 @@ func ixHandshakeStage2(f *Interface, addr netip.AddrPort, via *ViaSender, hh *Ha
        for _, network := range vpnNetworks {
                // vpnAddrs outside our vpn networks are of no use to us, filter them out
                vpnAddr := network.Addr()
-               if !f.myVpnNetworksTable.Contains(vpnAddr) {
-                       continue
-               }
+               /*
+                       if !f.myVpnNetworksTable.Contains(vpnAddr) {
+                               continue
+                       }
+               */
 
                filteredNetworks = append(filteredNetworks, network)
                vpnAddrs = append(vpnAddrs, vpnAddr)

I'm not sure what these filters accomplish, but removing them makes relays work again with this PR's included e2e test.

@brad-defined
Copy link
Collaborator

For the approach in the PR in which clients must craft CreateRelayRequest messages that will include IP's that map to something in the peer's (unknown) certificate vpn networks, this will need to also update the migrateRelayUsed() function in connection_manager.go.

@nbrownus
Copy link
Collaborator

nbrownus commented Oct 16, 2025

Nebula currently enforces that packets received on the tun device are within the networks it should be serving. The entry point is

nebula/inside.go

Lines 51 to 53 in fa8c013

hostinfo, ready := f.getOrHandshakeConsiderRouting(fwPacket, func(hh *HandshakeHostInfo) {
hh.cachePacket(f.l, header.Message, 0, packet, f.sendMessageNow, f.cachedPacketMetrics)
})

which ends up here for vpn networks

nebula/inside.go

Lines 128 to 136 in fa8c013

// getOrHandshakeNoRouting returns nil if the vpnAddr is not routable.
// If the 2nd return var is false then the hostinfo is not ready to be used in a tunnel
func (f *Interface) getOrHandshakeNoRouting(vpnAddr netip.Addr, cacheCallback func(*HandshakeHostInfo)) (*HostInfo, bool) {
if f.myVpnNetworksTable.Contains(vpnAddr) {
return f.handshakeManager.GetOrHandshake(vpnAddr, cacheCallback)
}
return nil, false
}

The table is generated from the configured certificates at boot here

nebula/pki.go

Lines 417 to 430 in fa8c013

for _, network := range crt.Networks() {
cs.myVpnNetworks = append(cs.myVpnNetworks, network)
cs.myVpnNetworksTable.Insert(network)
cs.myVpnAddrs = append(cs.myVpnAddrs, network.Addr())
cs.myVpnAddrsTable.Insert(netip.PrefixFrom(network.Addr(), network.Addr().BitLen()))
if network.Addr().Is4() {
addr := network.Masked().Addr().As4()
mask := net.CIDRMask(network.Bits(), network.Addr().BitLen())
binary.BigEndian.PutUint32(addr[:], binary.BigEndian.Uint32(addr[:])|^binary.BigEndian.Uint32(mask))
cs.myVpnBroadcastAddrsTable.Insert(netip.PrefixFrom(netip.AddrFrom4(addr), network.Addr().BitLen()))
}
}

Given these constraints, consider the following hosts:

Host A has a vpn network of 10.0.0.1/24
Host B has a vpn network of 192.168.0.1/24

Another example would be:

Host A has a vpn network of 10.0.0.1/24
Host B has a vpn network of fd99::1/64

In both cases there is no common network and they would be unable to ever communicate (in most cases). This condition is trapped and handshakes are rejected.

nebula/handshake_ix.go

Lines 214 to 222 in fa8c013

if len(vpnAddrs) == 0 {
f.l.WithError(err).WithField("udpAddr", addr).
WithField("certName", certName).
WithField("certVersion", certVersion).
WithField("fingerprint", fingerprint).
WithField("issuer", issuer).
WithField("handshake", m{"stage": 1, "style": "ix_psk0"}).Error("No usable vpn addresses from host, refusing handshake")
return
}

A less degenerate case would be:

Host A has a vpn network of 10.0.0.1/24 and fd99::1/64
Host B has a vpn network of fd99::2/64

In this case the two hosts can communicate over the ipv6 addresses but the ipv4 address in Host A is rather pointless.

This is where it starts getting complicated.

In nebula cert v1 we only had a single vpn network and we used that as the primary identifier for a host and it worked great. In nebula cert v2 we can now have multiple vpn networks and no longer have an obvious primary identifier. To solve that problem we decided to deterministically order the list of vpn networks and select the first entry as the primary identifier. The sort order puts the smallest ipv4 address in the first position.

The trouble comes into play whenever we select that primary identifier and try to do anything on the network with it if it is not within our own vpn networks.

A simple example of this would be a host with both ipv4 and ipv6 vpn networks and a lighthouse with only ipv6. LightHouseHandler.handleHostQuery will respond to the host that queried on its primary vpn addr. If that vpn addr is not within the lighthouses network, the packet will be dropped, and the query will never result in a response, which means the network doesn't work.

w.SendMessageToVpnAddr(header.LightHouse, 0, fromVpnAddrs[0], lhh.pb[:ln], lhh.nb, lhh.out[:0])

The basis of this change was introduced to fix these class of problems in #1318 one of which I recall being a relay establishment issue similar to the one we are trying to fix here:

IPv6 -> ipv4/6 relay -> ipv6 host
IPv6 -> ipv4/6 relay -> ipv4/6 host

But it missed this scenario:

IPv4/6 -> ipv4/6 relay -> ipv6 host

In summary, I don't think a "primary" vpn addr is the only way to address this general type of problem, but it looked to be the simplest and most efficient way to approach it. One alternative could be to always calculate the vpn addr in a common vpn network. Another could be what @JackDoan had floated with a sort of mac addressy type of L2 identifier which would likely necessitate a certificate format change and another thing to ensure is unique within your mesh. In every case we need to be able to map a vpn ip address to a HostInfo object or a remote ip address to handshake with.


Side note: we have discussed possibly adding a config option (or looking at the root CAs list of vpn networks, or both) to opt into additional vpn networks which would allow for situations like this to possibly work (at least in the ipv4 to ipv4 or ipv6 to ipv6 case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🐛 BUG: Nebula fails to relay ipv6 traffic from a v4/v6 host to a v6 only host

3 participants