You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/source/data/working-with-llms.rst
+45-21Lines changed: 45 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -205,6 +205,51 @@ You can also make calls to deployed models that have an OpenAI compatible API en
205
205
:start-after: __openai_example_start__
206
206
:end-before: __openai_example_end__
207
207
208
+
Batch inference with serve deployments
209
+
---------------------------------------
210
+
211
+
You can configure any :ref:`serve deployment <converting-to-ray-serve-application>` for batch inference. This is particularly useful for multi-turn conversations,
212
+
where you can use a shared vLLM engine across conversations. To achieve this, create an :ref:`LLM serve deployment <serving-llms>` and use
213
+
the :class:`ServeDeploymentProcessorConfig <ray.data.llm.ServeDeploymentProcessorConfig>` class to configure the processor.
In addition, you can customize the placement group strategy to control how Ray places vLLM engine workers across nodes.
235
+
While you can specify the degree of tensor and pipeline parallelism, the specific assignment of model ranks to GPUs is managed by the vLLM engine and you can't directly configure it through the Ray Data LLM API.
0 commit comments