Support for per-token latency tracking in `generate()` (suggested options: using callback, profiler class, or using a config flag)

### Feature request

**Summary**

I would like to propose a feature to enable **per-token latency tracking** during generation in Hugging Face’s `generate()` (via _beam_search, sample etc) loop. This is extremely useful for benchmarking models across hardware (e.g. CPU vs GPU, Arm vs x86), identifying bottlenecks, and understanding real-world inference performance at a granular level.

I’m suggesting **three high level designs** to add this feature support.

**Option 1: `per_token_latency_callback` (minimal, flexible)**
_This approach offers maximum flexibility with minimal intrusion - meaning - no config changes, just a simple hook for custom tracking._
By adding a new argument to `generate()`:
```
def generate(..., per_token_latency_callback: Optional[Callable[[int, float], None]] = None):
    for token_idx in range(max_length):
        start = time.perf_counter()
        # generation step
        end = time.perf_counter()
        if per_token_latency_callback:
            per_token_latency_callback(token_idx, end - start)
```

Example Usage (for benchmarking)

```
# Simple function to print per-token latency.  
# Can be extended to log, store, or analyze latencies as needed
def log_token_latency(token_idx, latency):
    print(f"Token {token_idx}: {latency * 1000:.2f} ms")

output = model.generate(
    input_ids,
    max_new_tokens=5,
    per_token_latency_callback=log_token_latency  # Usage with new proposed argument
)
```
This would allow users to easily plug in their own logic to log, store, or analyze latency **without touching the internal logic.**



**Option 2:` Profiler Class` along with `per_token_latency_callback` (structured, user-friendly)**
_This approach introduces a clean, reusable structure that can be extended to track other metrics like memory or FLOPs - ideal for advanced profiling_
Passing a reusable profiler object, for example:

```
class TokenLatencyProfiler:
    def __init__(self):
        self.token_latencies = []

    def track(self, token_idx, latency):
        self.token_latencies.append(latency)

profiler = TokenLatencyProfiler()
output = model.generate(input_ids, per_token_latency_callback=profiler.track)

print(profiler.token_latencies)
```

**This structure makes it easy to support other metrics in the future** (e.g. memory, FLOPs, peak RAM/GPU usage).



**Option 3: `track_token_latency=True` in GenerationConfig (inspired by Intel's IPEX)**
_This approach adds convenience for quick profiling but introduces some coupling between the generation config and the return object structure._

As a higher-level alternative - conceptually inspired by Intel’s IPEX benchmarking approach - this method introduces a dedicated configuration flag within GenerationConfig to enable built-in per-token latency tracking.
_Unlike IPEX, which sets this behavior via setattr() on the model config at runtime, this proposal integrates the flag directly into GenerationConfig_

- Update GenerationConfig

> ```
> track_token_latency: bool = field(
>     default=False,
>     metadata={"help": "If True, enables per-token latency tracking during generation."}
> )
> ```

- Use the flag internally

> ```
> def generate(self, input_ids, generation_config=None, max_length=20, ...):
>     generation_config = generation_config or self.generation_config
>     token_latencies = [] if generation_config.track_token_latency else None
> 
>     for token_idx in range(max_length):
>         start = time.perf_counter()
>         # generation step (e.g., logits -> sampling -> input_ids update)
>         end = time.perf_counter()
> 
>         if token_latencies is not None:
>             token_latencies.append(end - start)
> 
>     output = GenerationOutput(...)  # or SampleDecoderOnlyOutput, etc.
>     if token_latencies is not None:
>         output.token_latencies = token_latencies
> 
>     return output
> ```

- Example Usage

> ```
> from transformers import GenerationConfig
> 
> gen_config = GenerationConfig(track_token_latency=True)
> output = model.generate(input_ids, generation_config=gen_config)
> 
> print(output.token_latencies)  # → [0.012, 0.010, 0.011, ...]
> ```


### Motivation

As a performance engineer working with LLMs on hardware like Arm and x86 CPUs, I often need to measure **token-by-token latency** during generation (e.g., to analyze startup cost, cache reuse, SMT impact, or I/O stalls). 

However:
- There is currently **no easy way** to track **per-token latency** via Hugging Face’s `generate()` API.
- The existing `output.scores` provides logits but **not timing or performance hooks**.
- I discovered that Intel’s IPEX benchmark internally **tracks token latencies**, which was helpful for comparison - but it’s not easily accessible or extendable outside Intel’s stack.
- So I had to patch generate logic locally to do something like this:
  
```
# inside _beam_search under transformers/src/transformers/generation/utils.py

for token_idx in range(max_length):
    start = time.perf_counter()
    # generation
    end = time.perf_counter()
    latencies.append(end - start)
```

_But this (my patch) could break with updates to the Transformers library or models that override generate() with custom loops._

Having official support would benefit researchers, profiling tools, downstream libraries, and anyone benchmarking LLMs on novel hardware.

This also helps with:

- Debugging regressions in low-level compute kernels (e.g., matmul, softmax) that impact token generation performance
- Optimizing latency-critical use cases (e.g., serverless inference, streaming chat) by isolating startup cost, monitoring per-token response time, and diagnosing real-time slowdowns
- Analyzing time-to-first-token vs. steady-state generation performance

### Your contribution

Yes, I’m happy to submit a PR.

I can start by adding **Option 1** (`per_token_latency_callback`) as a lightweight, non-breaking change to `GenerationMixin`. Optionally, I can follow up with Option 2 (profiler class) and/or Option 3 (`track_token_latency=True` in `GenerationConfig`) depending on maintainers’ preferences and/or with additional recommendations/improvements. 

Please let me know if these changes are fine and if so which direction is preferred, and I’ll start implementing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for per-token latency tracking in `generate()` (suggested options: using callback, profiler class, or using a config flag) #39437

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for per-token latency tracking in generate() (suggested options: using callback, profiler class, or using a config flag) #39437

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Support for per-token latency tracking in `generate()` (suggested options: using callback, profiler class, or using a config flag) #39437