Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Support data parallel for vllm-v0 engine #339

Closed
wants to merge 5 commits into from

Conversation

qsunnyy
Copy link

@qsunnyy qsunnyy commented Mar 15, 2025

Set up data parallel communication, compatible with TP&EP.
Example commands to use data parallel: python examples/offline_inference_data_parallel.py

@qsunnyy qsunnyy force-pushed the main branch 2 times, most recently from 2ca0c67 to ad1bc10 Compare March 17, 2025 02:04
qsunnyy and others added 5 commits March 17, 2025 10:49
Signed-off-by: qsunnyy <qybottle@163.com>
Signed-off-by: q00832892 <qiaoyang19@huawei.com>
Signed-off-by: qsunnyy <qybottle@163.com>
Signed-off-by: q00832892 <qiaoyang19@huawei.com>
Signed-off-by: q00832892 <qiaoyang19@huawei.com>
Signed-off-by: q00832892 <qiaoyang19@huawei.com>
Signed-off-by: q00832892 <qiaoyang19@huawei.com>
from vllm.distributed.parallel_state import (
destroy_distributed_environment, destroy_model_parallel)

import vllm_ascend # noqa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need to import vllm-ascend

from . import config # noqa
from . import forward_context # noqa
from .distributed import utils # noqa
from .engine import llm_engine # noqa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these patches seems no related to model registion, if we have to do so, let's do patch in vllm_ascend/patch

backend_class._set_sequence_number_for_group()
pg._register_backend(device, backend_type, backend_class)

elif backend == "nccl":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the cuda related code here

LLMEngine.__init__ = new_init
LLMEngine.has_unfinished_requests = new_has_unfinished_requests
LLMEngine.has_unfinished_requests_dp = new_has_unfinished_requests_dp
print("[Success] Custom LLMEngine patch applied!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to patch engine? I don't think it's a good idea to patch engine

dtype=torch.int32)
from vllm.distributed.parallel_state import get_dp_group
dist.all_reduce(num_tokens_tensor, group=get_dp_group().device_group)
cu_tokens_across_dp_npu = torch.cumsum(num_tokens_tensor, dim=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why gpu could use cu_tokens_across_dp_cpu tensor, but we need to use a npu tensor? same question for process group.

# we use synchronous scheduling right now,
# adding a sync point here should not affect
# scheduling of the next batch
torch.cuda.synchronize()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
torch.cuda.synchronize()
torch.npu.synchronize()

@qsunnyy qsunnyy closed this by deleting the head repository Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants