-
Notifications
You must be signed in to change notification settings - Fork 232
feat: runtime switching for godirwalk and filepath.WalkDir and benchmark. #215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Here is the benchmark code I ran. |
|
will look at it on Mon; now running github workflows: lint and integration tests |
|
lint already failed, take a look. Also, maybe rebase on top make lintMust pass locally. |
* https://github.com/NVIDIA/aistore/security/dependabot/44 * #215 Signed-off-by: Alex Aizman <[email protected]>
|
running more tests now; separately, note: |
|
There are two test fails: Secondly, benchmark must be part of the commit - goes under I slightly modified it as follows: @@ -68,12 +68,19 @@
return dir
}
+const (
+ numFilesDflt = 500_000 // NOTE: 50K files is way too small
+ scratchDftl = 64 * 1024
+ depthDflt = 16
+)
+
// BenchmarkGodirwalkLargeDir benchmarks godirwalk on a flat directory.
func BenchmarkGodirwalkLargeDir(b *testing.B) {
- dir := setupLargeDir(b, 50000)
+ dir := setupLargeDir(b, numFilesDflt)
defer os.RemoveAll(dir)
- scratch := make([]byte, 32*1024)
+ scratch := make([]byte, scratchDftl)
+ b.ResetTimer()
var count int64
for i := 0; i < b.N; i++ {
err := godirwalk.Walk(dir, &godirwalk.Options{
@@ -94,9 +101,10 @@
...Results on my machine: $ go test -bench=. -benchtime=10s -benchmem
goos: linux
goarch: amd64
pkg: github.com/NVIDIA/aistore/tmp
cpu: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
BenchmarkGodirwalkLargeDir-16 20 509249643 ns/op 69046790 B/op 1550438 allocs/op
BenchmarkWalkDirLargeDir-16 19 541173847 ns/op 99095748 B/op 1553091 allocs/op
BenchmarkGodirwalkDeepTree-16 25 407719713 ns/op 102886230 B/op 1540834 allocs/op
BenchmarkWalkDirDeepTree-16 25 406888381 ns/op 132332444 B/op 1540819 allocs/op
PASS
ok github.com/NVIDIA/aistore/tmp 268.725sThere are a couple observations but most importantly, godirwarlk still maintains a fairly significant advantage memory-wise. |
|
finally, I'd recommend a bench that runs on multiple cores, as in:
More realistic, given that we never walk filesystems in a serial single-threaded way. |
VirrageS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And as @alex-aizman mentioned we have to have benchmarks as part of the commit.
fs/walkbck.go
Outdated
| } | ||
|
|
||
| func (j *joggerBck) cb(fqn string, de DirEntry) error { | ||
| func (j *joggerBck) cb(fqn string, de fs.DirEntry, err error) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we check if err != nil in this function?
fs/walk.go
Outdated
| import ( | ||
| "context" | ||
| iofs "io/fs" | ||
| "io/fs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use iofs "io/fs" across the board
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
@alex-aizman @VirrageS Added benchmark code. Do let me know if there is anymore correction needed. The test failure seems to be not related to code. |
|
Benchmark on my macine. goos: linux
goarch: amd64
pkg: github.com/NVIDIA/aistore/bench/walk
cpu: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
BenchmarkGodirwalkLargeDir-8 82 437096095 ns/op 67334033 B/op 1512335 allocs/op
BenchmarkWalkDirLargeDir-8 90 421606610 ns/op 97212893 B/op 1511242 allocs/op
BenchmarkGodirwalkDeepTree-8 132 249528975 ns/op 101420901 B/op 1508148 allocs/op
BenchmarkWalkDirDeepTree-8 129 273513059 ns/op 130853701 B/op 1508303 allocs/op
PASS
ok github.com/NVIDIA/aistore/bench/walk 821.341sAs godirwalk is more memory efficient do you want to keep godirwalk as it is now ? |
|
started now github workflows on your last commit
That is certainly the main question as godirwalk maintains a 15% and 30% edge over the standard library. Which was not surprising 3 years ago but it is, frankly, somewhat of a surprise today... I don't know. How about a build tag? What do you think? Ideas welcome. |
|
Sorry for the delay. Got caught up with something else. Will be fixing this asap.
Actually I started off this seeing so many other repos migrating to std lib WalkDir referencing the blog which benchmarked with linux kernel as test data. The godirwalk maintainer also stopped maintaining it linking the reference to that blog.
That would be a good idea to consider but I also came to know about this lib fastwalk which is a parallel version of filepath.WalkDir. According to fastwalk benchmark they put on their repo filepath.WalkDir has significant perf benifit to godirwalk But will have look into their benchmark code carefully to confirm. Do you think there were any discrepancies in our benchmark ? |
|
Never heard about it. Quote:
Well, that's quite a claim to make. Unclear though why do they not appear to be very popular... Anyway, back to this PR. Other than the benchmark (a solid piece of code) most of the changes look to me quite cosmetic: --- a/ais/tgttxn.go
+++ b/ais/tgttxn.go
@@ -7,6 +7,7 @@ package ais
import (
"errors"
"fmt"
+ iofs "io/fs"
"net/http"
"net/url"
"os"
@@ -1001,7 +1002,7 @@ func prmScan(dirFQN string, prmMsg *apc.PromoteArgs) (fqns []string, totalN int,
cksum *cos.CksumHash
autoDetect = !prmMsg.SrcIsNotFshare || !cmn.Rom.Features().IsSet(feat.DontAutoDetectFshare)
)
- cb := func(fqn string, de fs.DirEntry) (err error) {
+ cb := func(fqn string, de iofs.DirEntry, _ error) (err error) {
if de.IsDir() {
return
}
@@ -1021,7 +1022,7 @@ func prmScan(dirFQN string, prmMsg *apc.PromoteArgs) (fqns []string, totalN int,
cksum = cos.NewCksumHash(cos.ChecksumCesXxh)
}
if prmMsg.Recursive {
- opts := &fs.WalkOpts{Dir: dirFQN, Callback: cb, Sorted: true}
+ opts := &fs.WalkOpts{Dir: dirFQN, Callback: cb}
...I wonder if it'd be possible to absorb it into some sort of runtime polymorphism - at runtime choice of one possible As far as Finally, there's a different idea of what do we REALLY need from this type "walking" functionality. Hint: think NVMe drives. But - one step at a time. Let's get to some closure with this PR. PS. Please read and accept CONTRIBUTING.md (sign-off == acceptance) |
|
I dont recommend using fastwalk as its fairly new and nobody seems to be using it atleast for now. I was just hinting at their benchmark numbers and std lib WalkDir was performing much better than godirwalk in their benchmark.
Build tag would be an easier integration I think. What do you think ? |
|
sure, let's do build tag. There are pros and cons, we'll see how it looks |
|
Hey I went with runtime switching. No rebuild needed easier switching and extensible. Do let me know if there is any correction needed or if you want to do build tag. Thanks! |
|
I'll check later. Meantime, started all github workflows: lint, et al. |
fs/walk.go
Outdated
| } | ||
| if cmn.IsErrObjLevel(err) { | ||
| ew.counter.Inc() | ||
| return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In godirwalk we skip the entire directory by returning godirwalk.SkipNode. Should we do a similar thing for a standardWalk? We can return filepath.SkipDir: https://pkg.go.dev/path/filepath?utm_source=godoc#SkipDir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think nil fits better here. As per the WalkDirFunc docs (https://pkg.go.dev/io/fs#WalkDirFunc), SkipDir skips the current directory if d.IsDir() is true, but for files, it skips the parent directory. Returning nil skips just the node (file or directory) and continues the walk, which aligns with godirwalk.SkipNode’s behavior (https://pkg.go.dev/github.com/karrick/godirwalk#SkipNode).
|
I am ready to merge but I need you to sign-off as previously mentioned The following is optional (but still desirable), so let me know.
var useWalker walker = &godir{}
func UseGodirWalker() { useWalker = &godir{} }
func UseStdlibWalker() { useWalker = &stdlib{} }None of it it is actually needed. Use PS. Maybe in a separate commit. |
I realised it was GPG signed and wont show up on commit message after the CI was ran so thought to do it handle while merging. |
No, that's See https://github.com/NVIDIA/aistore/blob/main/CONTRIBUTING.md#signing-off-commits |
|
Hey sorry I meant I did GPG signing (-S) instead of the other one (-s). |
- Add runtime switching with default as godirwalk for better memory efficiency. - Benchmark godirwalk vs filepath.WalkDir. Signed-off-by: Gofastasf <[email protected]>
|
Hey I think it's ready to get merged now! |
|
@Gofastasf thank you so much. Incredible work and perseverance. |
|
@alex-aizman Hey thank you so much. Means a lot!! |

Related to #212