You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
By default, we make use of the configuration in [ray_configs/ray_config.yaml](./ray_configs/ray_config.yaml). You can also customize the following parameters for ray:
45
45
46
-
-`tensor_parallel_size`: Tensor Parallel Size per replica. Defaults to 4.
47
-
-`accelerator_type`: GPU accelerator type. For more information see the list of available types: https://docs.ray.io/en/latest/ray-core/accelerator-types.html. Defaults to None (uses any GPUs available in the ray cluster)
48
-
-`num_replicas`: Number of model replicas to use for inference. Defaults to 2.
49
-
-`batch_size`: Batch size per model replica for inference.
50
-
-`gpu_memory_utilization`: The fraction of GPU memory to be used for vLLM's model executor. Defaults to 0.9
51
-
-`dtype`: Data type for inference. (Defaults to "auto")
46
+
-`tensor_parallel_size`: Tensor parallel size per replica. Defaults to 4.
47
+
-`accelerator_type`: GPU accelerator type. See [the list of available types](https://docs.ray.io/en/latest/ray-core/accelerator-types.html) for more information. Defaults to None, which means any available GPUs in the Ray cluster will be used.
48
+
-`num_replicas`: Number of model replicas to use for inference. Defaults to 2.
49
+
-`batch_size`: Batch size per model replica for inference.
50
+
-`gpu_memory_utilization`: Fraction of GPU memory allocated to the model executor in vLLM. Defaults to 0.9.
51
+
-`dtype`: Data type used for inference. Defaults to "auto".
0 commit comments