Open
Description
#207 is only the first cut. Many TODO items are left
- Fix memory profiling Enable running PyTorch models #207 (comment)
- Make single-gpu performance at parity with the MLC model
- Make multi-gpu performance sane
- Consider using cuda graph if we decide to keep the 2D padded input representation
- Or, consider reverting the 2D input change
- Revisit custom changes to our vllm fork https://github.com/octoml/vllm/tree/for-mlc-serve and minimize them
- Figure out how to support other models besides the ones in vllm
- Support parallel-sampling eviction by recompute (requires model change)
Metadata
Metadata
Assignees
Labels
No labels