-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [V1] TPU support #11936
base: main
Are you sure you want to change the base?
[WIP] [V1] TPU support #11936
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This pull request has merge conflicts that must be resolved before it can be |
def __init__( | ||
self, | ||
vllm_config: VllmConfig, | ||
device: torch.device, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function implementation is almost identical to gpu_model_runner.py, it would be better if we build a ModelRunnerBase class and derive from the base class instead of duplicating.
return PrefillInputData( | ||
request_ids=prefill_request_ids, | ||
prompt_lens=prefill_prompt_lens, | ||
token_ids=prefill_token_ids, | ||
position_ids=prefill_position_ids, | ||
attn_metadata=prefill_attn_metadata, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the PrefillInputData
data structure, and make it consistent with gpu_model_runner ?
def _prepare_prefill_inputs( | ||
self, | ||
num_scheduled_tokens: List[int], | ||
) -> PrefillInputData: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need _prepare_prefill_inputs
in V1 ?
(i'm assuming this can be handled with _prepare_inputs
already.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to run separate prefill and decode for TPU since we don't have the attention kernel support yet. This is on the way so we hope to remove soon
attn_metadata=prefill_attn_metadata, | ||
) | ||
|
||
def _prepare_decode_inputs(self) -> DecodeInputData: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, do we really need _prepare_decode_inputs
in V1 architecture?
(i'm assuming this can be handled with _prepare_inputs
already.)
effective_query_lens=None, | ||
)) | ||
|
||
def _prepare_inputs(self, scheduler_output: "SchedulerOutput"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is almost identical with current gpu_model_runner implementation, consider reusing instead of duplicating ?
This PR is a rebase and modification of @robertgshaw2-neuralmagic original PR for TPU support from 1.5 months ago #10241
TODOs: