Skip to content

Remove repeated concat of prompt and decode tokens in detokenization #139

Open
@masahi

Description

@masahi

As of the parallel sampling work, we are maintaining prompt and decode tokens separately. However, to preserve the output of detokenization, I had to concat them repeatedly:
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/engine_common.py#L97-L99

It seems very wasteful to do concat and detokenization of the entire tokens repeatedly while what we really need is only the new delta at the postfix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions