[Tracking] PT model support follow up

https://github.com/octoml/mlc-llm/pull/207 is only the first cut. Many TODO items are left

- [ ] Fix memory profiling https://github.com/octoml/mlc-llm/pull/207#issuecomment-1955073134
- [ ] Make single-gpu performance at parity with the MLC model
- [ ] Make multi-gpu performance sane
- [ ] Consider using cuda graph if we decide to keep the 2D padded input representation
- [ ] Or, consider reverting the 2D input change
- [ ] Revisit custom changes to our vllm fork https://github.com/octoml/vllm/tree/for-mlc-serve and minimize them
- [ ] Figure out how to support other models besides the ones in vllm 
- [ ] Support parallel-sampling eviction by recompute (requires model change)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Tracking] PT model support follow up #217

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Tracking] PT model support follow up #217

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions