✨ Support prompt logprobs with static batching #274

joerunde · 2025-07-01T21:07:09Z

Description

This PR enables prompt logprobs with static batching, at batch size 1 only. This enables some experimentation and model evaluation tasks on spyre hardware.

For static batching, the requires us to warm the model up with only_last_token=False, which passes back the hidden state tensors for the entire (padded) prompt. This is a big performance penalty, and is also only supported on spyre cards with batch size 1 currently.

So, this PR introduces an environment flag VLLM_SPYRE_ENABLE_PROMPT_LOGPROBS which must be set to 1 to enable prompt logprobs. At bootup, we check to ensure that we're running in static batching mode with max batch size 1, and fail to boot otherwise. All requests that ask for prompt_logprobs will be rejected unless prompt logprobs are enabled. This is different than the behavior today, where requests always return [None] for prompt logprobs.

Signed-off-by: Joe Runde <[email protected]>

github-actions · 2025-07-01T21:07:16Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

Signed-off-by: Joe Runde <[email protected]>

tests/e2e/test_spyre_prompt_logprobs.py

yannicks1 · 2025-07-02T08:46:38Z

tests/e2e/test_spyre_prompt_logprobs.py

+    monkeypatch.setenv("VLLM_USE_V1", 1)
+    monkeypatch.setenv("VLLM_SPYRE_DYNAMO_BACKEND", backend)
+    monkeypatch.setenv("VLLM_SPYRE_ENABLE_PROMPT_LOGPROBS", 1)
+    llm = LLM(model)


for all the other e2e test we specify the following args explicitly:

vllm_model = LLM( model=model, tokenizer=model, max_model_len=max_model_len, max_num_seqs=max_num_seqs, block_size=block_size, tensor_parallel_size=tensor_parallel_size, )

might be nice to do this here too, to be a) consistent with the other test, and b) safe if default values should change...

Yeah, I've seen tests fail in upstream vLLM because the default max_model len is larger than what the hardware used for the tests supports. So I think it's a good idea to set these parameters to the minimum required for the test to pass. But isn't it an exceptional situation where the tokenizer is different than what vLLM would load as default for a specific model?

For max_model_len, max_num_seqs, and block_size I'd actually rather not set them here because right now this test is specifically only for static batching, so they're unused. Once we support continuous batching though, I agree we can and should set those explicitly.

I could go ahead and parameterize this test for multi-aiu if we want, that would be helpful for it to run when it detects multiple cards.

For tokenizer, I agree we shouldn't need to ever set it differently, and if we did it wouldn't work to set the same model name anyway. But... there's no harm in setting it for consistency with other tests I guess

tensor parallel works 🎉

yannicks1 · 2025-07-02T08:53:26Z

tests/expected_prompt_logprobs.json

@@ -0,0 +1,744 @@
+{
+    "Hello darkness my old friend": [


I love this prompt:)

Replaced with chicken soup prompts :(

But I can add it to the list of stock prompts if you want to keep it

yannicks1 · 2025-07-02T11:33:44Z

vllm_spyre/v1/worker/spyre_model_runner.py

+            offset = hidden_states.shape[0] - num_prompt_tokens
+
+            prompt_hidden_states = hidden_states[offset:offset + num_logits]
+            logits = self.model.compute_logits(prompt_hidden_states, None)


we already compute the logits in the forward pass on line 407. Would it be possible to reuse this tensor instead of recomputing?
something like:

def _get_prompt_logprobs_dict( self, logits: torch.Tensor, model_inputs: ModelForwardInputs, ) -> dict[str, Optional[LogprobsTensors]]: ... for loop: ... # Get the logits corresponding to this req's prompt tokens. req_idx = self.get_req_id_to_index(model_inputs.is_prompt)[req_id] logits = logits[req_idx] # The offset needs to account for the left padding that static # batching applies. # TODO: To support continuous batching the offset needs to be # calculated differently. offset = logits.shape[0] - num_prompt_tokens logits = logits[offset:offset + num_logits] ...

Yup, this works!

yannicks1 · 2025-07-02T11:47:11Z

vllm_spyre/v1/worker/spyre_model_runner.py

+
+    def get_num_prompt_logprobs(self, is_prefill: bool) -> dict[str, int]:
+        return (self.prefill_batch.num_prompt_logprobs
+                if is_prefill else self.input_batch.num_prompt_logprobs)


Are the two branches here needed? Do we ever return self.input_batch.num_prompt_logprobs (for decode)? differently asked, are we not exiting in line 295 if self.no_prompt_logprob(model_inputs.is_prompt) is True for decodes?

True true, this is very likely to only ever be called after a no_prompt_logprob guard so it's probably fine to simplify

(this also doesn't work on cb now anyway)

yannicks1 · 2025-07-02T11:49:50Z

tests/e2e/test_spyre_prompt_logprobs.py

+    model = "ibm-ai-platform/micro-g3.3-8b-instruct-1b"
+    num_prompt_logprobs = 5
+
+    json_path = Path(__file__).parent.parent / "expected_prompt_logprobs.json"


is there a way to get these log prompts from the huggingface model instead of reading hard coded values here?

If it's a small model, we could execute on CPU perhaps with transformers as reference implementation.

Yeah, I just didn't know how to implement that without reimplementing most of the code here again. I checked the upstream vllm tests for prompt logprobs and they run lm_eval with a test suite that uses prompt logprobs, which wouldn't be feasible for us to do. I figured that testing against known good results from vllm was a simple enough solution.

I'm down to pair on a solution, maybe I'll see if granite-3.3-8b can generate code to get prompt logprobs out of a transformers model.

replaced with hf implementation, which I hope is correct 😬

maxdebayser · 2025-07-02T12:46:11Z

Nice, this is looking good. I mostly agree with Yannick's comments.

Signed-off-by: Joe Runde <[email protected]>

joerunde · 2025-07-03T21:46:58Z

@yannicks1 @maxdebayser This should be ready for another look

I can follow up with enabling this for CB, that will require touching bits of the model that I think Yannick is currently working on to replace only_last_token=False with the token indices that we want to get back from the model

yannicks1

LGTM

yannicks1 · 2025-07-04T15:51:05Z

tests/e2e/test_spyre_tensor_parallel.py

@@ -1,4 +1,4 @@
-"""Verification of vLLM output by comparing with HF
+"""Tests validating the correctness and configuration of prompt_logprobs.


did you replace the wrong comment here?

hah, yeah I clicked on the wrong file, sorry. I'll fix it

yannicks1 · 2025-07-04T15:52:01Z

tests/e2e/test_spyre_prompt_logprobs.py

+    monkeypatch.setenv("VLLM_USE_V1", 1)
+    monkeypatch.setenv("VLLM_SPYRE_DYNAMO_BACKEND", backend)
+    monkeypatch.setenv("VLLM_SPYRE_ENABLE_PROMPT_LOGPROBS", 1)
+    llm = LLM(model)


yannicks1 · 2025-07-04T16:15:21Z

I can follow up with enabling this for CB, that will require touching bits of the model that I think Yannick is currently working on to replace only_last_token=False with the token indices that we want to get back from the model

I don't think logprobs for CB has priority before it is fully working. Currently we have only_last_token=False for CB, so it could indeed be added without graph changes, but you are right we are looking into replacing that. Final decision on when to incorporate that will be made soon. Once the decision has been made, we can address logprobs for CB in another PR. Advocating to get this in for SB now.

Signed-off-by: Joe Runde <[email protected]>

joerunde added 2 commits June 27, 2025 14:29

⚗️ try prompt logprobs

5933cda

Signed-off-by: Joe Runde <[email protected]>

✅ passing tests for bs=1

ec90006

Signed-off-by: Joe Runde <[email protected]>

joerunde requested review from rafvasq, prashantgupta24, sducouedic, yannicks1, tdoublep and nikolaospapandreou as code owners July 1, 2025 21:07

joerunde added 2 commits July 1, 2025 15:12

🐛 actually include test file

28a652b

Signed-off-by: Joe Runde <[email protected]>

🐛 add decoder marks too

698a5ee

Signed-off-by: Joe Runde <[email protected]>

yannicks1 reviewed Jul 2, 2025

View reviewed changes

joerunde added 6 commits July 3, 2025 12:07

✅ add reference hf implementation

f45efcc

Signed-off-by: Joe Runde <[email protected]>

🐛 pass params to compare_prompt_logprobs

ef7da62

Signed-off-by: Joe Runde <[email protected]>

📝 update test comment

ff925d7

Signed-off-by: Joe Runde <[email protected]>

✨ parameterize on tp size

f3c1f55

Signed-off-by: Joe Runde <[email protected]>

♻️ simplify get_num_prompt_logprobs

2dd0b21

Signed-off-by: Joe Runde <[email protected]>

♻️ dont recompute logits

2b94ea3

Signed-off-by: Joe Runde <[email protected]>

yannicks1 approved these changes Jul 4, 2025

View reviewed changes

📝 replace wrong docstring

417a30f

Signed-off-by: Joe Runde <[email protected]>

joerunde enabled auto-merge (squash) July 7, 2025 19:22

github-actions bot added the ready label Jul 7, 2025

joerunde merged commit 4190580 into main Jul 7, 2025
16 of 19 checks passed

joerunde deleted the prompt-logprobs branch July 7, 2025 19:25

		@@ -1,4 +1,4 @@
		"""Verification of vLLM output by comparing with HF
		"""Tests validating the correctness and configuration of prompt_logprobs.

✨ Support prompt logprobs with static batching #274

✨ Support prompt logprobs with static batching #274

Uh oh!

Conversation

joerunde commented Jul 1, 2025

Description

Uh oh!

github-actions bot commented Jul 1, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxdebayser commented Jul 2, 2025

Uh oh!

joerunde commented Jul 3, 2025

Uh oh!

yannicks1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yannicks1 commented Jul 4, 2025

Uh oh!

Uh oh!

Uh oh!