Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf record with --call-graph=fp --freq=max #1969

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

saethlin
Copy link
Member

I've been carrying this patch locally for months/years; @Noratrieb asked me to make this PR.

I only find profile_local perf-record useful with this patch applied, because otherwise the profile doesn't have enough samples to make anything of the profile data. And since we're sampling as fast as possible to get a reasonable signal-to-noise ratio on a microbenchmark, we need to use frame pointers. Which are now enabled by default in the compiler profile, (and are enabled in the distributed standard library too!).

@Kobzol
Copy link
Contributor

Kobzol commented Aug 24, 2024

Interesting! I wonder if dwarf produces better output than fp if you have debuginfo enabled. I usually enable debuginfo when profiling the compiler, but that of course doesn't help when profiling the distributed optimized artifacts :) I'm just wondering whether it would make sense to make this configurable, or somehow detect if the compiler has debuginfo (but that sounds way overkill).

Btw, how much RAM do you have? 😆 I tried to generate a profile for cargo/Full/Debug with --freq=max and doing perf report on the results OOMs with 32 GiB of RAM (not all of it is available though).. I'm not sure why though, the perf.data result only has about half a gig, which doesn't sound that bad. With --freq=997 (which is what I normally use), it's only 7 MiB on disk and perf report works. Maybe we could find some compromise that would make the recording have a high sampling rate, but still be usable, as max seems like it might be too much for some benchmarks.

@Noratrieb
Copy link
Member

dwarf is more precise (it knows about inlined functions) than fp, but produces a lot more data and is therefore more keen to just break. therefore it also requires the lower frequency, which in turn makes it less precise. fp generally works better and faster, but can't know about inlined functions.

@saethlin
Copy link
Member Author

saethlin commented Aug 24, 2024

I usually enable debuginfo when profiling the compiler, but that of course doesn't help when profiling the distributed optimized artifacts :)

I always set debuginfo-level = 1 because it doesn't turn off optimizations. Variable-level debuginfo is pretty much just a waste anyway with optimizations enabled.

I tried to generate a profile for cargo/Full/Debug with --freq=max and doing perf report on the results OOMs with 32 GiB of RAM (not all of it is available though)..

That's unfortunate. On my system this recording peaks at 19.6 GB memory usage; 20x blowup for in-memory data structures is pretty typical in my experience. My system has 128 GB of memory but that's not really relevant because the perf-report UI becomes unusably slow when you load this much data into it.

Granted, I'd never use this for profiling primary benchmarks. We should probably set a lower frequency for them.


and is therefore more keen to just break.

Yup. About half the time I reach for it, perf with dwarf callgraphs is completely unusable due to bugs in perf. It crashes when loading its recordings, with a variety of errors depending on your kernel/perf version. Occasionally it just segfaults.

@Kobzol
Copy link
Contributor

Kobzol commented Aug 24, 2024

Makes sense. So I would suggest this:

  1. Switch the frame resolving mode from dwarf to fp
  2. Make the default freq 997 (or something like that), so that it is still reasonably possible to run the profiler on all of our benchmarks, as it could be done before.
  3. Make it possible to override the frequency through a CLI flag (or just an ENV variable, to make it simpler to thread through rustc-perf, although it would be a bit opaque), so that you can set it to a higher frequency for smaller benchmarks.

I can do 3) in a follow-up PR if you don't want to deal with it.

Btw, what do you use to postprocess/analyze the perf.data file (apart from perf report)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants