[Tracking] InferenceEngine Remake

Recently we have landed many incredible features in mlc-serve (129 PRs in the last three months). With all the complexity coming with those features, it becomes harder to understand the code and make changes, especially around the InferenceEngine, a central piece involved in most features. Now it’s a good time to iterate the implementation of InferenceEngine based on what we learned, building a solid foundation for us to continue the momentum of delivering new features.

The goal of this tracking issue is to:

- G1: Improve the readability of InferenceEngine, and reduce the friction of introducing new features.
- G2: Enable early detection of performance or correctness regression in InferenceEngine.

Items:
- [ ] Revisit the interface of Engine and ModelModule, make sure those interfaces are good for other near term goals (dynamic split fuse, speculative decoding, common kv cache interface)
- [ ] Test Framework that mocks the ModelModule based on profiling data from real execution.
    - Enable correctness testing and performance benchmark without real model and GPU.
- [ ] Test cases that reflect the bugs we have seen in the past
- [ ] Remove all unused code (SynchronousInferenceEngine)
- [ ] Remove indirection (engine_common)
- [ ] Clean up RequestState
- [ ] Extract the request scheduling logic into standalone component
- [ ] Tokenization in a separate process (For lower time-to-first-token)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Tracking] InferenceEngine Remake #193

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Tracking] InferenceEngine Remake #193

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions