Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [V1] TPU support #11936

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

[WIP] [V1] TPU support #11936

wants to merge 18 commits into from

Conversation

alexm-neuralmagic
Copy link
Collaborator

@alexm-neuralmagic alexm-neuralmagic commented Jan 10, 2025

This PR is a rebase and modification of @robertgshaw2-neuralmagic original PR for TPU support from 1.5 months ago #10241

TODOs:

  1. Verify correctness
  2. Refactor the code to remove duplications
  3. Tweak attention parameters for best performance.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link

mergify bot commented Jan 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexm-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 10, 2025
Alexander Matveev added 2 commits January 10, 2025 16:10
Comment on lines +81 to +85
def __init__(
self,
vllm_config: VllmConfig,
device: torch.device,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function implementation is almost identical to gpu_model_runner.py, it would be better if we build a ModelRunnerBase class and derive from the base class instead of duplicating.

Comment on lines +382 to +388
return PrefillInputData(
request_ids=prefill_request_ids,
prompt_lens=prefill_prompt_lens,
token_ids=prefill_token_ids,
position_ids=prefill_position_ids,
attn_metadata=prefill_attn_metadata,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the PrefillInputData data structure, and make it consistent with gpu_model_runner ?

Comment on lines +318 to +321
def _prepare_prefill_inputs(
self,
num_scheduled_tokens: List[int],
) -> PrefillInputData:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need _prepare_prefill_inputs in V1 ?
(i'm assuming this can be handled with _prepare_inputs already.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to run separate prefill and decode for TPU since we don't have the attention kernel support yet. This is on the way so we hope to remove soon

attn_metadata=prefill_attn_metadata,
)

def _prepare_decode_inputs(self) -> DecodeInputData:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, do we really need _prepare_decode_inputs in V1 architecture?
(i'm assuming this can be handled with _prepare_inputs already.)

effective_query_lens=None,
))

def _prepare_inputs(self, scheduler_output: "SchedulerOutput"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is almost identical with current gpu_model_runner implementation, consider reusing instead of duplicating ?

Alexander Matveev added 2 commits January 11, 2025 03:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants