Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux specific improvements (BIG potential speedup) #1687

Open
alexcu2718 opened this issue Mar 5, 2025 · 3 comments
Open

Linux specific improvements (BIG potential speedup) #1687

alexcu2718 opened this issue Mar 5, 2025 · 3 comments

Comments

@alexcu2718
Copy link

alexcu2718 commented Mar 5, 2025

Hi All,

Links to be found here

https://crates.io/crates/fdf

https://github.com/alexcu2718/fdf

https://github.com/alexcu2718/fdf/tree/main/fd_benchmarks

I've made a rough skeleton copy of fd.

The reason I've done this was learning rust and C and I am terribly disorganised and want to combine my efforts into something where I utilise both.

So I do the natural thing and...make an overly complicated tool to fight it. I'm a genius, you don't need to tell me.

I have replicated about 30-40% of the features, note that I can't be too bothered to recreate the rest.

I'm posting this here as a question to see if the maintainers would want me to further develop my idea---commit to this project., or take it into my own project.

Some small issues I haven't bothered to touch yet because still VERY work-in-progress.

  1. I have not implemented custom errors, it's pretty Box dyn error style... (Handle errors at the wrapup...not early on)

  2. I believe my parallelism attempts are far from ideal, I think I can develop my traversal strategy to be much more refined.

  3. I do not know how reliable my methodology would be on eg: btrfs or ext2, due to using basic cheap syscalls to do so.

Quick rundown of methodology:

I basically remade a read_dir that uses inputs and outputs raw bytes, this is handy because I can pass it to regex without any cost ( and also recurse without any overhead!)

I've minimised heap allocations, not enough I believe, I'm still very new to C-RUST.

By using cheaper syscalls than eg fstat, I manage to keep the speed pretty damn good. I do get a lot of metadata for free. Notable exceptions are symlinks/executables, the speed for filtering these is still faster than fd.

There's a lot of unsafe code in here, mostly raw pointer casts, I've tested it on a recent Arch+Debian install and it works out a lot quicker/no issues of UB.

NOTE:

I HAVE NOT DONE THE 'NO PATTERN' as there's some weird bugs for them not aligning.
(There's weird issues with either truncation or an extra slash being added? Not sure,
given the fact the rest of the benchmarks are spot on, I'm wondering if it's temporary files or whatever)

the benchmarks seen here are 100% matching(IT IS MUCH FASTER though)

The following benchmarks (works on my machine TM)

Command Mean [ms] Min [ms] Max [ms] Relative
fdf -HI '.[0-9].jpg$' '/home/alexc' 354.1 ± 1.3 352.6 356.6 5.88 ± 0.08
fdf '.
[0-9].jpg$' '/home/alexc' 60.2 ± 0.8 59.1 63.8 1.00
fd -HI '.[0-9].jpg$' '/home/alexc' 460.0 ± 13.8 446.8 490.4 7.64 ± 0.25
fd '.
[0-9].jpg$' '/home/alexc' 152.2 ± 1.1 150.4 154.8 2.53 ± 0.04

Command Mean [ms] Min [ms] Max [ms] Relative
fdf -HI --extension 'jpg' '' '/home/alexc' 451.2 ± 2.7 447.8 456.0 1.00
fd -HI --extension 'jpg' '' '/home/alexc' 669.9 ± 13.0 659.1 703.1 1.48 ± 0.03

Command Mean [ms] Min [ms] Max [ms] Relative
fdf . '/home/alexc' -HI --type l 489.0 ± 2.2 484.6 491.7 1.00
fd -HI '' '/home/alexc' --type l 622.2 ± 3.2 616.3 625.9 1.27 ± 0.01

I will say that developing this has some pretty IFFY* choices performance wise in some regards, mostly I wanted to get the main skeleton working. I'm also aware I might need to totally redesign some aspects, what do you expect from a guy who's been learning for only 4 months when he's sick of his shitty python/bash/C# job.

(*though I think my DirEntry is pretty damn good efficiency wise!)

So,

Please let me know your thoughts. If you'd like me to do a proper rewrite and you'd accept the code(if it looked good), I'd be happy to do so.

Thanks,

Alex

@alexcu2718
Copy link
Author

alexcu2718 commented Mar 5, 2025

Added the fixes for no pattern to my repo.

There's some weird bits with excluding the start dir, also I have no idea why xonshrc/.bash_logout get excluded.

copypasting results below:

Summary
fdf '' '/home/alexc' -HI ran
1.54 ± 0.04 times faster than fd '' '/home/alexc' -HI
WARNING: There were differences between the search results of fd and find!
Run 'diff /tmp/results.fd /tmp/results.find'.
the count of files in the results.fd are 2426601
the count of files in the results.find are 2426600
the total difference are 6
❯ diff /tmp/results.fd /tmp/results.find
0a1

/home/alexc
8d8
< /home/alexc/.bash_logout
2425419d2425418
< /home/alexc/.xonshrc

@tmccombs
Copy link
Collaborator

I'm not really sure what the purpose of this issue is. Do you think there is something from your cose that could be applied to fd to make it faster?

@alexcu2718
Copy link
Author

I'm not really sure what the purpose of this issue is. Do you think there is something from your cose that could be applied to fd to make it faster?

I talked with sharkdp via email. Basically he told me to put it as an issue.

Essentially why I've put this here is a proof of concept of how to increase the speed. Given the fact it's a total rewrite for only Linux(maybe bsd, no idea about Macos), it's probably more effort than it's worth. So I was debating whether the potential speed increase was worth a total rewrite. In my benchmarks it's at least 1.4x speedup, sometimes up to 3x, so wondered if that was of interest.

Otherwise I'll probably just make my own utility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants