[Bug] Recovering logic of a long evicted request is broken

https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/engine_common.py#L385-L399

For streaming case, we cannot clamp the generated tokens and recompute them.
Moreover, since the clamping logic is done in the worker but not in the main process, the discrepancy arises between the main and the worker process. See https://github.com/octoml/mlc-llm/pull/158 and https://github.com/octoml/mlc-llm/issues/164.

We need to either 
* Require that generation never grows beyond `max_num_batched_tokens`
* Split recovering of such requires into multiple batch, using the new `evaluate_multi_query` function from https://github.com/octoml/mlc-llm/pull/156


@elvin-n @sunggg 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Recovering logic of a long evicted request is broken #163

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Recovering logic of a long evicted request is broken #163

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions