Open
Description
Recently we have landed many incredible features in mlc-serve (129 PRs in the last three months). With all the complexity coming with those features, it becomes harder to understand the code and make changes, especially around the InferenceEngine, a central piece involved in most features. Now it’s a good time to iterate the implementation of InferenceEngine based on what we learned, building a solid foundation for us to continue the momentum of delivering new features.
The goal of this tracking issue is to:
- G1: Improve the readability of InferenceEngine, and reduce the friction of introducing new features.
- G2: Enable early detection of performance or correctness regression in InferenceEngine.
Items:
- Revisit the interface of Engine and ModelModule, make sure those interfaces are good for other near term goals (dynamic split fuse, speculative decoding, common kv cache interface)
- Test Framework that mocks the ModelModule based on profiling data from real execution.
- Enable correctness testing and performance benchmark without real model and GPU.
- Test cases that reflect the bugs we have seen in the past
- Remove all unused code (SynchronousInferenceEngine)
- Remove indirection (engine_common)
- Clean up RequestState
- Extract the request scheduling logic into standalone component
- Tokenization in a separate process (For lower time-to-first-token)
Metadata
Metadata
Assignees
Labels
No labels