Skip to content

Lightweight recording and sampling of performance counters for specific code segments directly from your C++ application.

License

Notifications You must be signed in to change notification settings

jmuehlig/perf-cpp

Repository files navigation

perf-cpp: Effortless Hardware Performance Monitoring for C++ Applications

LGPL-3.0 LinuxKernel->=4.0 C++17 Ask DeepWiki

Quick Start | How to Build | Documentation | System Requirements

perf-cpp embeds Linux's hardware performance monitoring directly into your code, letting you profile exactly what matters and process the results in your application. Tools like Linux Perf, Intel® VTune™, and AMD uProf are powerful but monitor entire programs – and high-performance applications need surgical precision.

What can perf-cpp do?

Built around Linux's powerful perf subsystem, perf-cpp provides a clean interface for counting and sampling hardware events – without the complexity of low-level APIs.

  • Measure exactly what you want – utilize performance counters to count hardware events, similar to perf stat, but around specific code paths, not an entire binary (documentation).
  • Calculate metrics such as cycles per instruction and cache miss to access ratio based on hardware events and timing (documentation).
  • Low-latency performance counters access without starting/stopping the counters, for micro-benchmarks or adaptive tuning (documentation).
  • Record instruction and memory samples, just like perf [mem] record – but from inside your application (documentation).
  • Correlate samples with data structures and symbols to generate per-class access statistics and flame graphs.
  • Mix built-in events (e.g., cycles, instructions, cache misses, ...) with processor-specific counters (documentation).

See various practical examples and the documentation for more details.

Quick Start

Record Hardware Event Statistics

Recording hardware event statistics operates much like perf stat: it quantifies critical events–such as executed instructions, CPU cycles, and cache misses–throughout a code segment's execution.

#include <perfcpp/event_counter.h>

/// Initialize the counter
const auto counter_definition = perf::CounterDefinition{};
auto event_counter = perf::EventCounter{ counter_definition };

/// Specify hardware events to count
event_counter.add({"seconds", "instructions", "cycles", "cache-misses"});

/// Run the workload
event_counter.start();
code_to_profile(); /// <-- Statistics recorded while execution
event_counter.stop();

/// Print the result to the console
const auto result = event_counter.result();
for (const auto [event_name, value] : result)
{
    std::cout << event_name << ": " << value << std::endl;
}

Possible output:

seconds:      0.0955897 
instructions: 5.92087e+07
cycles:       4.70254e+08
cache-misses: 1.35633e+07

Note

For additional insights please refer to the guides on recording event statistics and event statistics on multiple CPUs/threads. Also, check out the hardware events documentation for details on both built-in and processor-specific events.

Record Samples

Recording samples functions much like perf [mem] record: it captures execution snapshots, e.g., the instruction pointer, executing CPU, and timestamp, at regular intervals (here every 4,000th CPU cycle).

#include <perfcpp/sampler.h>

/// Create the sampler
const auto counter_definition = perf::CounterDefinition{};
auto sampler = perf::Sampler{ counter_definition };

/// Specify when a sample is recorded: every 50,000th cycle
sampler.trigger("cycles", perf::Period{50000U});

/// Specify what data is included into a sample: time, CPU ID, instruction
sampler.values()
    .timestamp(true)
    .cpu_id(true)
    .instruction_pointer(true);

/// Run the workload
sampler.start();
code_to_profile(); /// <-- Samples recorded while execution
sampler.stop();

/// Print the samples to the console
const auto samples = sampler.result();
for (const auto& record : samples)
{
    const auto timestamp = record.metadata().timestamp().value();
    const auto cpu_id = record.metadata().cpu_id().value();
    const auto instruction = record.instruction_execution().logical_instruction_pointer().value();
    
    std::cout 
        << "Time = " << timestamp << " | CPU = " << cpu_id
        << " | Instruction = 0x" << std::hex << instruction << std::dec
        << std::endl;
}

Possible output:

Time = 365449130714033 | CPU = 8 | Instruction = 0x5a6e84b2075c
Time = 365449130913157 | CPU = 8 | Instruction = 0x64af7417c75c
Time = 365449131112591 | CPU = 8 | Instruction = 0x5a6e84b2075c
Time = 365449131312005 | CPU = 8 | Instruction = 0x64af7417c75c 

Note

For additional details–such as the types of data that can be included in samples–please consult the sampling guide. Additionally, consult the sampling on multiple CPUs/threads guide for instructions on parallel sampling.

More Examples

We include a collection of examples demonstrating the functionality and interface of perf-cpp in the examples/ directory, including

  • examples for counting hardware events (examples/statistics)
  • and for sampling (examples/sampling).

Building

perf-cpp is designed as a library (static or shared) that can be linked to your application.

# Clone the repository
git clone https://github.com/jmuehlig/perf-cpp.git

# Switch to the repository folder
cd perf-cpp

# Optional: Switch to this development version
git checkout v0.12.0

# Build the library (in build/)
# -DBUILD_EXAMPLES=1        compiles all examples (optional)
# -DBUILD_LIB_SHARED=1      creates the library as a shared one (optional)
# -DGEN_PROCESSOR_EVENTS=1  generates and compiles a .cpp file that adds events specific to the underlying CPU (optional)
cmake . -B build -DBUILD_EXAMPLES=1
cmake --build build

# Optional: Build examples (in build/examples/bin) if -DBUILD_EXAMPLES=1
cmake --build build --target examples

Note

Further information and detailed building instructions (e.g., how to integrate into CMake projects) are available in the building guide.

Full Documentation

Further Reading

  • Examples: Learn how to set up different features from code-examples.
  • Changelog: Stay updated with the latest changes and improvements.

System Requirements

  • Clang / GCC with support for C++17 features.
  • CMake version 3.10 or higher.
  • Linux Kernel 4.0 or newer (note that some features need a newer Kernel).
  • perf_event_paranoid setting: Adjust as needed to allow access to performance counters (see the Paranoid Value documentation).
  • Python3, if you make use of processor-specific hardware event generation.

Contribute and Contact

We welcome contributions and feedback. For feature requests, feedback, or bug reports, please reach out via our issue tracker or submit a pull request.

Alternatively, you can email me: [email protected].


Further PMU-related Projects

Below is a non-exhaustive list of some other valuable profiling projects:

  • PAPI offers access not only to CPU performance counters but also to a variety of other hardware components including GPUs, I/O systems, and more.
  • Likwid is a collection of several command line tools for benchmarking, including an extensive wiki.
  • PerfEvent provides lightweight access to performance counters, facilitating streamlined performance monitoring.
  • Intel's Instrumentation and Tracing Technology allows applications to manage the collection of trace data effectively when used in conjunction with Intel VTune Profiler.
  • For those who prefer a more hands-on approach, the perf_event_open system call can be utilized directly without any wrappers.

Resources about (Perf-) Profiling

This is a non-exhaustive list of academic research papers and blog articles (feel free to add to it, e.g., via pull request – also your own work).

Academical Papers

Blog Posts