Quick Start | How to Build | Documentation | System Requirements
perf-cpp embeds Linux's hardware performance monitoring directly into your code, letting you profile exactly what matters and process the results in your application. Tools like Linux Perf, Intel® VTune™, and AMD uProf are powerful but monitor entire programs – and high-performance applications need surgical precision.
Built around Linux's powerful perf subsystem, perf-cpp provides a clean interface for counting and sampling hardware events – without the complexity of low-level APIs.
- Measure exactly what you want – utilize performance counters to count hardware events, similar to
perf stat
, but around specific code paths, not an entire binary (documentation). - Calculate metrics such as cycles per instruction and cache miss to access ratio based on hardware events and timing (documentation).
- Low-latency performance counters access without starting/stopping the counters, for micro-benchmarks or adaptive tuning (documentation).
- Record instruction and memory samples, just like
perf [mem] record
– but from inside your application (documentation). - Correlate samples with data structures and symbols to generate per-class access statistics and flame graphs.
- Mix built-in events (e.g., cycles, instructions, cache misses, ...) with processor-specific counters (documentation).
See various practical examples and the documentation for more details.
Recording hardware event statistics operates much like perf stat
: it quantifies critical events–such as executed instructions, CPU cycles, and cache misses–throughout a code segment's execution.
#include <perfcpp/event_counter.h>
/// Initialize the counter
const auto counter_definition = perf::CounterDefinition{};
auto event_counter = perf::EventCounter{ counter_definition };
/// Specify hardware events to count
event_counter.add({"seconds", "instructions", "cycles", "cache-misses"});
/// Run the workload
event_counter.start();
code_to_profile(); /// <-- Statistics recorded while execution
event_counter.stop();
/// Print the result to the console
const auto result = event_counter.result();
for (const auto [event_name, value] : result)
{
std::cout << event_name << ": " << value << std::endl;
}
Possible output:
seconds: 0.0955897
instructions: 5.92087e+07
cycles: 4.70254e+08
cache-misses: 1.35633e+07
Note
For additional insights please refer to the guides on recording event statistics and event statistics on multiple CPUs/threads. Also, check out the hardware events documentation for details on both built-in and processor-specific events.
Recording samples functions much like perf [mem] record
: it captures execution snapshots, e.g., the instruction pointer, executing CPU, and timestamp, at regular intervals (here every 4,000
th CPU cycle).
#include <perfcpp/sampler.h>
/// Create the sampler
const auto counter_definition = perf::CounterDefinition{};
auto sampler = perf::Sampler{ counter_definition };
/// Specify when a sample is recorded: every 50,000th cycle
sampler.trigger("cycles", perf::Period{50000U});
/// Specify what data is included into a sample: time, CPU ID, instruction
sampler.values()
.timestamp(true)
.cpu_id(true)
.instruction_pointer(true);
/// Run the workload
sampler.start();
code_to_profile(); /// <-- Samples recorded while execution
sampler.stop();
/// Print the samples to the console
const auto samples = sampler.result();
for (const auto& record : samples)
{
const auto timestamp = record.metadata().timestamp().value();
const auto cpu_id = record.metadata().cpu_id().value();
const auto instruction = record.instruction_execution().logical_instruction_pointer().value();
std::cout
<< "Time = " << timestamp << " | CPU = " << cpu_id
<< " | Instruction = 0x" << std::hex << instruction << std::dec
<< std::endl;
}
Possible output:
Time = 365449130714033 | CPU = 8 | Instruction = 0x5a6e84b2075c
Time = 365449130913157 | CPU = 8 | Instruction = 0x64af7417c75c
Time = 365449131112591 | CPU = 8 | Instruction = 0x5a6e84b2075c
Time = 365449131312005 | CPU = 8 | Instruction = 0x64af7417c75c
Note
For additional details–such as the types of data that can be included in samples–please consult the sampling guide. Additionally, consult the sampling on multiple CPUs/threads guide for instructions on parallel sampling.
We include a collection of examples demonstrating the functionality and interface of perf-cpp in the examples/
directory, including
- examples for counting hardware events (
examples/statistics
) - and for sampling (
examples/sampling
).
perf-cpp is designed as a library (static or shared) that can be linked to your application.
# Clone the repository
git clone https://github.com/jmuehlig/perf-cpp.git
# Switch to the repository folder
cd perf-cpp
# Optional: Switch to this development version
git checkout v0.12.0
# Build the library (in build/)
# -DBUILD_EXAMPLES=1 compiles all examples (optional)
# -DBUILD_LIB_SHARED=1 creates the library as a shared one (optional)
# -DGEN_PROCESSOR_EVENTS=1 generates and compiles a .cpp file that adds events specific to the underlying CPU (optional)
cmake . -B build -DBUILD_EXAMPLES=1
cmake --build build
# Optional: Build examples (in build/examples/bin) if -DBUILD_EXAMPLES=1
cmake --build build --target examples
Note
Further information and detailed building instructions (e.g., how to integrate into CMake projects) are available in the building guide.
- Building: Integrate perf-cpp seamlessly into your C++ projects.
- Counting Performance Events
- Basics: Master recording hardware event statistics directly within your application.
- Parallel and Multithreaded: Learn how to monitor events across threads and CPU cores.
- Metrics: Learn how to combine hardware events into meaningful metrics for clearer performance insights.
- Live Access: See how events can be accessed without stopping the recording, ideal for profiling tight loops and small functions.
- Recording Samples
- Basics: Understand sampling mechanisms, which data to record, and how to access the results.
- Parallel and Multithreaded: Learn how to record samples in multithreaded workloads.
- Translating Instruction Pointers into Symbols and Samples into flame graphs: See how to translate instruction pointers into function names and prepare sampling results to transform them into flame graphs (e.g., using FlameGraph).
- Analyzing Memory Access Patterns: See how to link memory sampling data to specific data objects to profile detailed memory access characteristics.
- Built-in and Hardware-specific Events: Discover built-in events and learn how to define new ones tailored to your hardware.
- Perf Paranoid: Learn how to configure perf permissions.
- Examples: Learn how to set up different features from code-examples.
- Changelog: Stay updated with the latest changes and improvements.
- Clang / GCC with support for C++17 features.
- CMake version 3.10 or higher.
- Linux Kernel 4.0 or newer (note that some features need a newer Kernel).
perf_event_paranoid
setting: Adjust as needed to allow access to performance counters (see the Paranoid Value documentation).- Python3, if you make use of processor-specific hardware event generation.
We welcome contributions and feedback. For feature requests, feedback, or bug reports, please reach out via our issue tracker or submit a pull request.
Alternatively, you can email me: [email protected]
.
Below is a non-exhaustive list of some other valuable profiling projects:
- PAPI offers access not only to CPU performance counters but also to a variety of other hardware components including GPUs, I/O systems, and more.
- Likwid is a collection of several command line tools for benchmarking, including an extensive wiki.
- PerfEvent provides lightweight access to performance counters, facilitating streamlined performance monitoring.
- Intel's Instrumentation and Tracing Technology allows applications to manage the collection of trace data effectively when used in conjunction with Intel VTune Profiler.
- For those who prefer a more hands-on approach, the perf_event_open system call can be utilized directly without any wrappers.
This is a non-exhaustive list of academic research papers and blog articles (feel free to add to it, e.g., via pull request – also your own work).
- Quantitative Evaluation of Intel PEBS Overhead for Online System-Noise Analysis (2017)
- Analyzing memory accesses with modern processors (2020)
- Precise Event Sampling on AMD Versus Intel: Quantitative and Qualitative Comparison (2023)
- Multi-level Memory-Centric Profiling on ARM Processors with ARM SPE (2024)