Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve hash function for SockAddr #25

Open
aszlig opened this issue Jul 20, 2020 · 0 comments
Open

Improve hash function for SockAddr #25

aszlig opened this issue Jul 20, 2020 · 0 comments

Comments

@aszlig
Copy link
Contributor

aszlig commented Jul 20, 2020

Right now the hash function uses strings to determine hash value. However, this is quite expensive and we can also do better in terms of avoiding collisions.

While the port and address family are the most important parts where we want to avoid collisions, the IP addresses also need to be hashed properly so that ideally we won't get any collision at all.

So here's how IP addresses are assigned starting from the most common cause to the least common:

  • Remote peer determined via SO_PEERCRED:

    • IPv4: Equal to PID (uint32_t)
    • IPv6: 0xfe800000UUUUUUUUGGGGGGGGPPPPPPPP (U -> UID; G -> GID; P -> PID)
  • Implicit binding via connect or sendto/sendmsg:

    • IPv4: Equal to PID (uint32_t)
    • IPv6: 0xfe800000UUUUUUUUGGGGGGGGPPPPPPPP (U -> UID; G -> GID; P -> PID)
  • If address is already a loopback address passed to eg. bind, keep it as is.

  • If it's an implicit binding done via recvfrom or recvmsg, we get local bindings with the following IP addresses:

    • IPv4: 0x00XXXXXX (X -> Random value)
    • IPv6: 0xfe8000000000000000000000XXXXXXXX (X -> Random value)

Since size_t is guaranteed to have at least 16 bits, it's probably safe to fall back to only encode the port if size_t is 16 bit, since we usually have one low port and several ephemeral ports. Also, we're only interested in Unix domain sockets, so distinguishing between IPv4 and IPv6 could even be neglected.

If size_t is 32 bits or more, it's certainly a good idea to encode the IP address as well. Since the PID is always the last part of the address for the two most common cases, it's probably safe to only encode the PID. Similarly even if we already got a loopback address or a random IP address, the last 32 bits are the interesting ones.

However, we still need the port and address family, so we certainly want to trim the IP address portion down to 16 bit if we have exactly 32 bits. Of course, on 64 bit systems, we can even use 48 bits, so collisions are even more unlikely to occur. What I'm not sure about here is which value to use for upper 16 bits of those 48 bits. 0xfe80? Or just a bit flag?

Update: I've removed the first two cases above, because they're not actually used in an unordered_map.

aszlig added a commit that referenced this issue Jul 7, 2021
The LGTM service constantly annoys me with this alert, which in fact
isn't actually a thing that's broken, because the function in question
is just slower than it could be.

This is going to be addressed at some point[1] but it's not something
critical and since "FIXME" is the wrong label anyway, let's get rid of
that annoying alert.

[1]: #25

Signed-off-by: aszlig <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant