Skip to content

Support for per-token latency tracking in generate() (suggested options: using callback, profiler class, or using a config flag) #39437

@spsagar13

Description

@spsagar13

Feature request

Summary

I would like to propose a feature to enable per-token latency tracking during generation in Hugging Face’s generate() (via _beam_search, sample etc) loop. This is extremely useful for benchmarking models across hardware (e.g. CPU vs GPU, Arm vs x86), identifying bottlenecks, and understanding real-world inference performance at a granular level.

I’m suggesting three high level designs to add this feature support.

Option 1: per_token_latency_callback (minimal, flexible)
This approach offers maximum flexibility with minimal intrusion - meaning - no config changes, just a simple hook for custom tracking.
By adding a new argument to generate():

def generate(..., per_token_latency_callback: Optional[Callable[[int, float], None]] = None):
    for token_idx in range(max_length):
        start = time.perf_counter()
        # generation step
        end = time.perf_counter()
        if per_token_latency_callback:
            per_token_latency_callback(token_idx, end - start)

Example Usage (for benchmarking)

# Simple function to print per-token latency.  
# Can be extended to log, store, or analyze latencies as needed
def log_token_latency(token_idx, latency):
    print(f"Token {token_idx}: {latency * 1000:.2f} ms")

output = model.generate(
    input_ids,
    max_new_tokens=5,
    per_token_latency_callback=log_token_latency  # Usage with new proposed argument
)

This would allow users to easily plug in their own logic to log, store, or analyze latency without touching the internal logic.

Option 2: Profiler Class along with per_token_latency_callback (structured, user-friendly)
This approach introduces a clean, reusable structure that can be extended to track other metrics like memory or FLOPs - ideal for advanced profiling
Passing a reusable profiler object, for example:

class TokenLatencyProfiler:
    def __init__(self):
        self.token_latencies = []

    def track(self, token_idx, latency):
        self.token_latencies.append(latency)

profiler = TokenLatencyProfiler()
output = model.generate(input_ids, per_token_latency_callback=profiler.track)

print(profiler.token_latencies)

This structure makes it easy to support other metrics in the future (e.g. memory, FLOPs, peak RAM/GPU usage).

Option 3: track_token_latency=True in GenerationConfig (inspired by Intel's IPEX)
This approach adds convenience for quick profiling but introduces some coupling between the generation config and the return object structure.

As a higher-level alternative - conceptually inspired by Intel’s IPEX benchmarking approach - this method introduces a dedicated configuration flag within GenerationConfig to enable built-in per-token latency tracking.
Unlike IPEX, which sets this behavior via setattr() on the model config at runtime, this proposal integrates the flag directly into GenerationConfig

  • Update GenerationConfig
track_token_latency: bool = field(
    default=False,
    metadata={"help": "If True, enables per-token latency tracking during generation."}
)
  • Use the flag internally
def generate(self, input_ids, generation_config=None, max_length=20, ...):
    generation_config = generation_config or self.generation_config
    token_latencies = [] if generation_config.track_token_latency else None

    for token_idx in range(max_length):
        start = time.perf_counter()
        # generation step (e.g., logits -> sampling -> input_ids update)
        end = time.perf_counter()

        if token_latencies is not None:
            token_latencies.append(end - start)

    output = GenerationOutput(...)  # or SampleDecoderOnlyOutput, etc.
    if token_latencies is not None:
        output.token_latencies = token_latencies

    return output
  • Example Usage
from transformers import GenerationConfig

gen_config = GenerationConfig(track_token_latency=True)
output = model.generate(input_ids, generation_config=gen_config)

print(output.token_latencies)  # → [0.012, 0.010, 0.011, ...]

Motivation

As a performance engineer working with LLMs on hardware like Arm and x86 CPUs, I often need to measure token-by-token latency during generation (e.g., to analyze startup cost, cache reuse, SMT impact, or I/O stalls).

However:

  • There is currently no easy way to track per-token latency via Hugging Face’s generate() API.
  • The existing output.scores provides logits but not timing or performance hooks.
  • I discovered that Intel’s IPEX benchmark internally tracks token latencies, which was helpful for comparison - but it’s not easily accessible or extendable outside Intel’s stack.
  • So I had to patch generate logic locally to do something like this:
# inside _beam_search under transformers/src/transformers/generation/utils.py

for token_idx in range(max_length):
    start = time.perf_counter()
    # generation
    end = time.perf_counter()
    latencies.append(end - start)

But this (my patch) could break with updates to the Transformers library or models that override generate() with custom loops.

Having official support would benefit researchers, profiling tools, downstream libraries, and anyone benchmarking LLMs on novel hardware.

This also helps with:

  • Debugging regressions in low-level compute kernels (e.g., matmul, softmax) that impact token generation performance
  • Optimizing latency-critical use cases (e.g., serverless inference, streaming chat) by isolating startup cost, monitoring per-token response time, and diagnosing real-time slowdowns
  • Analyzing time-to-first-token vs. steady-state generation performance

Your contribution

Yes, I’m happy to submit a PR.

I can start by adding Option 1 (per_token_latency_callback) as a lightweight, non-breaking change to GenerationMixin. Optionally, I can follow up with Option 2 (profiler class) and/or Option 3 (track_token_latency=True in GenerationConfig) depending on maintainers’ preferences and/or with additional recommendations/improvements.

Please let me know if these changes are fine and if so which direction is preferred, and I’ll start implementing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions