Skip to content

Conversation

jampe
Copy link
Contributor

@jampe jampe commented Mar 19, 2025

=> This PR builds upon PR "Incoming Handshake filtering based on firewall rules" #1357 and will remain in draft status until that PR is either merged or declined.

This pull request introduces the functionality to send the flattened firewall rules (hostname, groups, group combos and cidrs) to a node's lighthouses. After a node has transmitted this information to its lighthouse, the lighthouse will filter host queries for that node based on the provided whitelist. The node includes this flattened firewall rules in the initial HostUpdateNotification message and will resend it if there are any changes to the data.

Config changes:

A new configuration value has been added within the Lighthouse section to enable this feature:

# This setting on a lighthouse determines whether to enforce the host query protection
# whitelist received from a node. On a node, this setting controls whether the node
# sends its handshake filtering whitelist to the lighthouses at all.
#enable_host_query_protection: false

Implementation details:

When sending a HostUpdateNotification message, the function HandshakeFilter:ToHandshakeFilteringWhitelist is called to convert the whitelist from its map format to the appropriate protobuf data structure. Then the function HandshakeFilter:FromHandshakeFilteringWhitelist performs the inverse operation on the lighthouse side. Afterwards the lighthouse stores the HandshakeFilter instance in the node's RemoteList structure by invoking RemoteList:unlockedSetHandshakeFilteringWhitelist().

Upon receiving host queries, the lighthouse checks for any stored rules associated with the queried node. If rules are present, the whitelist is consulted to determine whether the query should be permitted.

To accommodate the flattened firewall rules sent to the lighthouse, the HostUpdateNotification message has been extended. The whitelist is sent to the lighthouses only on the initial request, when the whitelist changes locally, or when the tunnel to the lighthouse was rebuilt. This approach helps reduce unnecessary network load for users operating Nebula with many nodes.

To identify potentially malicious or infected nodes, a new metric, "lighthouse.hostqueries.filtered", has been introduced to track filtered host queries on a lighthouse.

@jampe jampe force-pushed the feature_lighthouse_host_query_filtering branch from 864bc8e to f7f3ff3 Compare March 19, 2025 17:53
@JackDoan
Copy link
Collaborator

This is a really interesting concept, thanks for putting this together!

If I understand this correctly, it looks like you're deciding whether or not to filter a host query based on if that host's firewall would permit a handshake, which definitely makes sense. I was curious if you had considered an implementation where hosts send their entire firewall to the lighthouse? The lighthouse wouldn't be able to evaluate protocol+port rules, obviously, but in theory, you could allow the lighthouse to filter out queries that it "knows" would be unable to communicate with the queried-for host.

One potential issue that comes to mind with both approaches is the potential size of the HandshakeFilteringWhitelist field. The internet is not very good at delivering fragmented packets, and if there are a lot of rules (or rules with very long group names!), the NebulaMeta packet would potentially be undeliverable. This problem technically already exists for certificates, but I think it's more likely to crop up with firewall rules, since they frequently end up needing to express many different combinations, rather than describe a single host, like a cert does.

@jampe
Copy link
Contributor Author

jampe commented Mar 20, 2025

If I understand this correctly, it looks like you're deciding whether or not to filter a host query based on if that host's firewall would permit a handshake, which definitely makes sense. I was curious if you had considered an implementation where hosts send their entire firewall to the lighthouse? The lighthouse wouldn't be able to evaluate protocol+port rules, obviously, but in theory, you could allow the lighthouse to filter out queries that it "knows" would be unable to communicate with the queried-for host.

Yes, your description somewhat summarizes the functionality implemented by this PR. The foundational PR #1357 establishes filtering at the node level, filtering connections between nodes that either know the nebula port or can infer it. This PR then extends the filtering capabilities to the lighthouse, implementing the approach you outlined.

Initially, my intention was to transmit the inbound firewall rule structure to the lighthouse. However, upon reviewing the code, I recognized that, given the information available to the lighthouse for filtering, I can omit some data. The HandshakeFilteringWhitelist structure contains deduplicated hosts, groups, combinations of groups (ANDed groups), CIDR blocks, and other relevant entities that the lighthouse can utilize for filtering purposes.

One potential issue that comes to mind with both approaches is the potential size of the HandshakeFilteringWhitelist field. The internet is not very good at delivering fragmented packets, and if there are a lot of rules (or rules with very long group names!), the NebulaMeta packet would potentially be undeliverable. This problem technically already exists for certificates, but I think it's more likely to crop up with firewall rules, since they frequently end up needing to express many different combinations, rather than describe a single host, like a cert does.

Good input! While I considered strategies to minimize network and CPU load on both nodes and lighthouses - such as transmitting data only in the initial message or when local firewall rules are modified - I had not accounted for the issue of packet fragmentation.

A solution could be to create a dedicated message type. Since the user controls the MTU, I could manage packet sizes accordingly and split the data into multiple packets as necessary. Group names may be likely to be repeated across firewall rules, compression could be efficient in further reducing the size of the transmitted data. What do you think?

@nbrownus
Copy link
Collaborator

@jampe can you describe the situation that led you to a client authoritative approach for this?

@jampe
Copy link
Contributor Author

jampe commented Mar 28, 2025

@jampe can you describe the situation that led you to a client authoritative approach for this?

I have given this considerable thought and descided based on the following arguments:

Arguments for Client Authority

  • Simplicity of Configuration: Maintaining a straightforward configuration without duplicates helps prevent configuration failures.
  • Scalability: There is no need to redeploy or reload Lighthouse configurations when a client is added or changed within the network. In my opinion, this is a significant advantage of Nebula's existing design, thus something I'd like to keep that way.
  • Lighthouses only retain information about currently online nodes. There's no data about offline or outdated nodes on a lighthouse.

Arguments for Lighthouse Authority

  • Enforcement of Access Policies: Lighthouses can enforce access policies on clients, which may help prevent data leaks if a client falsefully attempts to modify its fw configuration.
  • Simpler Implementation: This approach requires less code modification, leading to an implementation with less code touched.
  • There is no additional network load on the Lighthouses.

Neutral Considerations

  • Malicious clients could potentially block connections to themselfs at the Lighthouse level when using the client sided approach. However, if a malicious actor can manipulate the local Nebula configuration, they would already have the capability to block access through firewall rule modifications anyway.

In my view, the arguments in favor of client authority outweigh those supporting Lighthouse authority. I have aimed to minimize the amount of code touched and to keep the implementation as simple as possible. Additionally, the extra network load is kept to a minimum by sending the rules to the Lighthouse only during the initial announcement or when the local firewall rules have changed.

jampe added 4 commits March 30, 2025 12:57
- send HostQueryWhitelist when NebulaMeta_HostUpdateNotificationAck is handled
- support multiple HostQueryWhitelist messages based on node mtu
- handle lost udp packet using NebulaMeta_HostQueryWhitelistAck message
- improved testing
}
}

if initial || c.HasChanged("tun.routes") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont really like parsing tun.routes here too but I didn't find a better way to have access to the mtu data. do you know a better way?

@nbrownus
Copy link
Collaborator

nbrownus commented Apr 1, 2025

@jampe I was looking for more information regarding your mesh network and what properties it had that led you to choose the client authoritative model.

For example, a lighthouse authoritative model would fit more cleanly with networks that have someone with administrative permissions over the entire network, think a company wide ops team or running your own home lab.

@jampe
Copy link
Contributor Author

jampe commented Apr 13, 2025

@jampe I was looking for more information regarding your mesh network and what properties it had that led you to choose the client authoritative model.

Well we looked at nebula during a red team assignment of a customer where we discovered that we could leak internal ip's after we gained access to an infected, external client. We had some spare time so the customer allowed us to submit a PR attempting to fix the problem.

They use an ansible role to manage / create all the configs. So basically it would be possible for them to manage the rules on the lighthouse as well. I just looked at the code and thought a client authoritative model is more "nebula style" as the lighthouse has zero information about its clients when starting up and learns all the data when they announce themselves (e.g. the ips or the relay servers to use with a client).

@jampe
Copy link
Contributor Author

jampe commented May 12, 2025

I thought about how this would change if we were to implement it on the lighthouse side. The implementation would be relatively straightforward, touching way less code. To maintain the same level of control as the current implementation, I would propose adding a new section to the lighthouse configuration, as follows:

lighthouse:
  # This option governs which nodes are permitted to query specific hosts (or groups of hosts)
  # within the network. You can specify either a single host, a range of hosts or a subnet as the key, along with
  # the corresponding rules that determine which nodes are authorized to perform host queries as value.
  # Nodes that do not conform to the defined rules but still attempt to make a query will receive
  # no response. If no rules are defined, all queries will be permitted.
  hostquery_filtering:
    10.0.10.10:
      hosts:
        - 10.0.10.100
      group:
        - europe
        - webservers
    10.0.10.20,10.0.10.25,10.0.10.29:
     # this AND's the groups. A node has to be in each group for the query to be allowed
      group:
        - - webservers
          - apache
        - - webservers
          - nginx
          - reverseproxy
    10.0.10.30-10.0.10.35:
      cidrs:
        - 10.0.10.0/24
    10.0.10.40:
      ca_names:
        - myCA
      ca_shas:
        - xxxxxx
    10.0.10.128/26,10.0.10.192/28:
      group:
        - europe
        - datacenter-a
      ca_shas:
        - xxxxxx

With something like this, the only remaining task would be to code the parsing logic, as the filtering logic could be utilized largely as it is.

I still think the dynamic, client side approach is more flexible however 9 definitely see the argument of central management and strict rule enforcement via lighthouse rules as well. Additionally with adding / editing way less code we would prevent adding additional complexity.

I'd be happy to help implementing what ever solution / approach you guys prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants