Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

device: add BenchmarkAllowedIPsInsertRemove #36

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bradfitz
Copy link
Contributor

To show that RemoveByPeer is slow. Currently:

(pprof) top
Showing nodes accounting for 2.99s, 96.14% of 3.11s total
Dropped 35 nodes (cum <= 0.02s)
Showing top 10 nodes out of 36
      flat  flat%   sum%        cum   cum%
     2.72s 87.46% 87.46%      2.72s 87.46%  golang.zx2c4.com/wireguard/device.(*trieEntry).removeByPeer
     0.10s  3.22% 90.68%      0.10s  3.22%  runtime.memclrNoHeapPointers
     0.05s  1.61% 92.28%      0.06s  1.93%  runtime.scanobject
     0.03s  0.96% 93.25%      0.05s  1.61%  runtime.casgstatus
     0.02s  0.64% 93.89%      0.02s  0.64%  runtime.(*gcBitsArena).tryAlloc (inline)
     0.02s  0.64% 94.53%      0.02s  0.64%  runtime.heapBitsSetType
     0.02s  0.64% 95.18%      0.04s  1.29%  runtime.sweepone
     0.01s  0.32% 95.50%      0.02s  0.64%  golang.zx2c4.com/wireguard/device.commonBits
     0.01s  0.32% 95.82%      0.03s  0.96%  runtime.(*mheap).allocSpan
     0.01s  0.32% 96.14%      0.24s  7.72%  runtime.mallocgc

Signed-off-by: Brad Fitzpatrick [email protected]

/cc @zx2c4 @crawshaw @danderson

To show that RemoveByPeer is slow. Currently:

(pprof) top
Showing nodes accounting for 2.99s, 96.14% of 3.11s total
Dropped 35 nodes (cum <= 0.02s)
Showing top 10 nodes out of 36
      flat  flat%   sum%        cum   cum%
     2.72s 87.46% 87.46%      2.72s 87.46%  golang.zx2c4.com/wireguard/device.(*trieEntry).removeByPeer
     0.10s  3.22% 90.68%      0.10s  3.22%  runtime.memclrNoHeapPointers
     0.05s  1.61% 92.28%      0.06s  1.93%  runtime.scanobject
     0.03s  0.96% 93.25%      0.05s  1.61%  runtime.casgstatus
     0.02s  0.64% 93.89%      0.02s  0.64%  runtime.(*gcBitsArena).tryAlloc (inline)
     0.02s  0.64% 94.53%      0.02s  0.64%  runtime.heapBitsSetType
     0.02s  0.64% 95.18%      0.04s  1.29%  runtime.sweepone
     0.01s  0.32% 95.50%      0.02s  0.64%  golang.zx2c4.com/wireguard/device.commonBits
     0.01s  0.32% 95.82%      0.03s  0.96%  runtime.(*mheap).allocSpan
     0.01s  0.32% 96.14%      0.24s  7.72%  runtime.mallocgc

Signed-off-by: Brad Fitzpatrick <[email protected]>
@zx2c4
Copy link
Member

zx2c4 commented Jul 14, 2020

Same issue in the kernel code. That's a hard traversal to speed up without increasing the size of each node beyond a cacheline and therefore making lookups slow. Any suggestions?

@zx2c4
Copy link
Member

zx2c4 commented Jul 14, 2020

@bradfitz
Copy link
Contributor Author

At least in our case (and perhaps with others?), the overwhelming majority of routes are complete IPv4 or IPv6 addresses (cidr /32 or /128). I was planning on adding a Go map alongside the trie and using both: map for complete addresses and trie for prefixes. That does mean some lookups (for non-complete addresses) need to consult both. I'm fine with that if it means reducing the removeByPeer cost, which is eating 40% of our CPU on our big shared test node accessible to all users.

@zx2c4
Copy link
Member

zx2c4 commented Jul 14, 2020

Instead of trying to add special cases -- whose complexity I wouldn't be so happy about having here -- what about implementing better/faster algorithms for the general case? Specifically, check out https://github.com/openbsd/src/blob/master/sys/net/art.c https://github.com/openbsd/src/blob/master/sys/net/art.h I would very very gladly take an implementation of this directly into wireguard-go (and would prefer it there instead of in a separate repo).

@bradfitz
Copy link
Contributor Author

Oh, nice, I hadn't seen that. PDF from the comments there: http://www.hariguchi.org/art/art.pdf

@zx2c4
Copy link
Member

zx2c4 commented Jul 14, 2020

Right. Basically it sounds like what happened somebody submitted a paper for a new routing table data structure. Knuth reviewed it, and during the review thought of something better. And that's ART.

LC-Tries are also pretty fast, but not very fun to implement, and ART may well outperform it.

Weidong Wu has a great book called "Packet Forwarding Technologies" that compares a lot of these different structures, but the latest addition I've found is 2007, which doesn't cover ART unfortunately. However, the combination of versatility, code compactness, and simplicity makes me prefer ART over other ones I've implemented in toys.

Copy link
Collaborator

@crawshaw crawshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark LGTM

(The ART data structure is nice.)

a.RemoveByPeer(peers[(i+num/2)%num])
}

// Finally, some stats & validity checks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This work at the end is getting added to your total benchmark time and making your number fuzzier. Does calling b.StopTimer() just before this work?

rand.Seed(1)
rand.Shuffle(num, func(i, j int) { ips[i], ips[j] = ips[j], ips[i] })

// Then repeatedly add one and remove one that was insert 32k inserts back.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/insert /inserted /

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants