Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use Litserve to serve vllm engine #382

Open
GhostXu11 opened this issue Dec 4, 2024 · 2 comments
Open

how to use Litserve to serve vllm engine #382

GhostXu11 opened this issue Dec 4, 2024 · 2 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@GhostXu11
Copy link

πŸ› Bug

I want to use vllm enging in litserve and enable batching. The code below works fine, but when I make parallel requests, the results of both requests are returned in the result of the first request. Do you know how to modify the code to avoid this?

Code sample

class SimpleLitAPI(ls.LitAPI):
    def setup(self, device):
        config = self._config
        engine_args = EngineArgs.from_cli_args(config)
        self.llm_engine = LLMEngine.from_engine_args(engine_args)
        self.tokenizer = self.llm_engine.get_tokenizer_group(TokenizerGroup).tokenizer
        self.model_config = self.llm_engine.get_model_config()

    def batch(self, inputs):
        return inputs

    def predict(self, prompt, context):
        conversations = []
        for item in prompt:
            conversation, mm_data = parse_chat_messages(item, self.model_config,
                                                        self.tokenizer)
            conversations.append(conversation)
        prompts = []
        logger.info(f'type_tokenizer: {type(self.tokenizer)}')
        for conversation in conversations:
            prompt = apply_hf_chat_template(
                self.tokenizer,
                conversation,
                chat_template=None,
            )
            prompts.append(prompt)
        inputs = []
        for _ in prompts:
            inputs.append(TextPrompt(prompt=_)
        contexts = context
        context_list = []
        for context in contexts:
            # set params
            temperature = float(context.get("temperature", 1.0))
            top_p = float(context.get("top_p", 1.0))
            max_new_tokens = context.get("max_new_tokens", 256)
            top_k = context.get("top_k", -1)
            presence_penalty = float(context.get("presence_penalty", 0.0))
            frequency_penalty = float(context.get("frequency_penalty", 0.0))
            best_of = context.get("best_of", None)

            # make sampling params in vllm
            top_p = max(top_p, 1e-5)
            if temperature <= 1e-5:
                top_p = 1.0

            sampling_params = SamplingParams(
                n=1,
                temperature=temperature,
                top_p=top_p,
                max_tokens=max_new_tokens,
                top_k=top_k,
                presence_penalty=presence_penalty,
                frequency_penalty=frequency_penalty,
                best_of=best_of,
            )
            context_list.append(sampling_params)
        result = list(zip(inputs, context_list))

        while result or self.llm_engine.has_unfinished_requests():
            if result:
                prompt, sampling_params = result.pop(0)
                request_id = random_uuid()
                self.llm_engine.add_request(str(request_id), prompt, sampling_params)
            previous_output = ""
            while self.llm_engine.has_unfinished_requests():
                request_outputs: List[RequestOutput] = self.llm_engine.step()
                for request_output in request_outputs:
                    current_output = request_output.outputs[0].text
                    new_output = current_output[len(previous_output):]  # ζε–ζ–°η”Ÿζˆηš„ιƒ¨εˆ†
                    previous_output = current_output  # ζ›΄ζ–°ε·²θΎ“ε‡Ίηš„ιƒ¨εˆ†

                    if new_output:  # εͺεœ¨ζœ‰ζ–°η”Ÿζˆηš„ζ–‡ζœ¬ζ—ΆθΎ“ε‡Ί
                        yield [{'role': 'assistant', 'content': new_output}]
@GhostXu11 GhostXu11 added bug Something isn't working help wanted Extra attention is needed labels Dec 4, 2024
@GhostXu11
Copy link
Author

I don't know if my idea is right. Maybe I need to get all the outputs and combine them into a list and put it in the yield line to execute the batch function normally?

@aniketmaurya
Copy link
Collaborator

hi @GhostXu11, the yield from predict should be a list of length of the input batch size where each item contains generated token for each request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants