-
Notifications
You must be signed in to change notification settings - Fork 29.7k
Description
Feature request
Summary
I would like to propose a feature to enable per-token latency tracking during generation in Hugging Face’s generate()
(via _beam_search, sample etc) loop. This is extremely useful for benchmarking models across hardware (e.g. CPU vs GPU, Arm vs x86), identifying bottlenecks, and understanding real-world inference performance at a granular level.
I’m suggesting three high level designs to add this feature support.
Option 1: per_token_latency_callback
(minimal, flexible)
This approach offers maximum flexibility with minimal intrusion - meaning - no config changes, just a simple hook for custom tracking.
By adding a new argument to generate()
:
def generate(..., per_token_latency_callback: Optional[Callable[[int, float], None]] = None):
for token_idx in range(max_length):
start = time.perf_counter()
# generation step
end = time.perf_counter()
if per_token_latency_callback:
per_token_latency_callback(token_idx, end - start)
Example Usage (for benchmarking)
# Simple function to print per-token latency.
# Can be extended to log, store, or analyze latencies as needed
def log_token_latency(token_idx, latency):
print(f"Token {token_idx}: {latency * 1000:.2f} ms")
output = model.generate(
input_ids,
max_new_tokens=5,
per_token_latency_callback=log_token_latency # Usage with new proposed argument
)
This would allow users to easily plug in their own logic to log, store, or analyze latency without touching the internal logic.
Option 2: Profiler Class
along with per_token_latency_callback
(structured, user-friendly)
This approach introduces a clean, reusable structure that can be extended to track other metrics like memory or FLOPs - ideal for advanced profiling
Passing a reusable profiler object, for example:
class TokenLatencyProfiler:
def __init__(self):
self.token_latencies = []
def track(self, token_idx, latency):
self.token_latencies.append(latency)
profiler = TokenLatencyProfiler()
output = model.generate(input_ids, per_token_latency_callback=profiler.track)
print(profiler.token_latencies)
This structure makes it easy to support other metrics in the future (e.g. memory, FLOPs, peak RAM/GPU usage).
Option 3: track_token_latency=True
in GenerationConfig (inspired by Intel's IPEX)
This approach adds convenience for quick profiling but introduces some coupling between the generation config and the return object structure.
As a higher-level alternative - conceptually inspired by Intel’s IPEX benchmarking approach - this method introduces a dedicated configuration flag within GenerationConfig to enable built-in per-token latency tracking.
Unlike IPEX, which sets this behavior via setattr() on the model config at runtime, this proposal integrates the flag directly into GenerationConfig
- Update GenerationConfig
track_token_latency: bool = field( default=False, metadata={"help": "If True, enables per-token latency tracking during generation."} )
- Use the flag internally
def generate(self, input_ids, generation_config=None, max_length=20, ...): generation_config = generation_config or self.generation_config token_latencies = [] if generation_config.track_token_latency else None for token_idx in range(max_length): start = time.perf_counter() # generation step (e.g., logits -> sampling -> input_ids update) end = time.perf_counter() if token_latencies is not None: token_latencies.append(end - start) output = GenerationOutput(...) # or SampleDecoderOnlyOutput, etc. if token_latencies is not None: output.token_latencies = token_latencies return output
- Example Usage
from transformers import GenerationConfig gen_config = GenerationConfig(track_token_latency=True) output = model.generate(input_ids, generation_config=gen_config) print(output.token_latencies) # → [0.012, 0.010, 0.011, ...]
Motivation
As a performance engineer working with LLMs on hardware like Arm and x86 CPUs, I often need to measure token-by-token latency during generation (e.g., to analyze startup cost, cache reuse, SMT impact, or I/O stalls).
However:
- There is currently no easy way to track per-token latency via Hugging Face’s
generate()
API. - The existing
output.scores
provides logits but not timing or performance hooks. - I discovered that Intel’s IPEX benchmark internally tracks token latencies, which was helpful for comparison - but it’s not easily accessible or extendable outside Intel’s stack.
- So I had to patch generate logic locally to do something like this:
# inside _beam_search under transformers/src/transformers/generation/utils.py
for token_idx in range(max_length):
start = time.perf_counter()
# generation
end = time.perf_counter()
latencies.append(end - start)
But this (my patch) could break with updates to the Transformers library or models that override generate() with custom loops.
Having official support would benefit researchers, profiling tools, downstream libraries, and anyone benchmarking LLMs on novel hardware.
This also helps with:
- Debugging regressions in low-level compute kernels (e.g., matmul, softmax) that impact token generation performance
- Optimizing latency-critical use cases (e.g., serverless inference, streaming chat) by isolating startup cost, monitoring per-token response time, and diagnosing real-time slowdowns
- Analyzing time-to-first-token vs. steady-state generation performance
Your contribution
Yes, I’m happy to submit a PR.
I can start by adding Option 1 (per_token_latency_callback
) as a lightweight, non-breaking change to GenerationMixin
. Optionally, I can follow up with Option 2 (profiler class) and/or Option 3 (track_token_latency=True
in GenerationConfig
) depending on maintainers’ preferences and/or with additional recommendations/improvements.
Please let me know if these changes are fine and if so which direction is preferred, and I’ll start implementing.