[no merge, testing premerge issues] #57674

ZacAttack · 2025-10-13T15:39:45Z

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…ep name (ray-project#56951) Previously, the custom image build job just lists the first 2 tests overall but didn't filter based on the tests that it's associated with.... --------- Signed-off-by: kevin <[email protected]> Signed-off-by: Kevin H. Luu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ct#56856) Add v2 multinode persistence release test by doing the following: * `test_persistence.py` uses v1 or v2 functions depending on `is_v2_enabled` * The v2 release tests are `variations` on the existing `train_multinode_persistence` entry in `release_tests.yaml` --------- Signed-off-by: Timothy Seah <[email protected]>

…ect#56960) Reverts ray-project#56314 Signed-off-by: Edward Oakes <[email protected]>

dead code; no longer used anywhere. Signed-off-by: Lonnie Liu <[email protected]>

## Why are these changes needed? This feature adds the ability to (de)serialize arbitrary PyArrow extension arrays. This is needed to use Ray in code bases that use extension arrays. ~The serialization already seemed sufficiently general, but as far as I can tell, the deserialization can not be done in generality. Hence, this setup allows registration of custom deserializers for extension types.~ ~For serialization, the selector has been changed from `ExtensionType` to `BaseExtensionType` to accommodate for non-Python ExtensionArrays, like `pyarrow.FixedShapeTensorArray`.~ ~This is at the moment a proof-of-concept. If you like the idea, I suppose the registration function may need to move to a better place, and docs need adding.~ The implementation now works without registration on any extension type. ## Related issue number Closes ray-project#51959 ## Checks - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Generalizes Arrow array (de)serialization to any `pyarrow.BaseExtensionType`, removing tensor-specific handling and adding tests for fixed/variable-shape tensors. > > - **Arrow (De)serialization**: > - Switch from tensor-specific checks to generic `pyarrow.BaseExtensionType` handling. > - Reconstruct extension arrays via `type.wrap_array(storage)`; serialize via storage payload wrapped with extension metadata. > - Remove `ray.air.util.tensor_extensions.arrow` dependencies and special-casing. > - **Tests**: > - Add roundtrip tests for `pa.FixedShapeTensorArray` and a custom variable-shape `ExtensionType`. > - Import `PicklableArrayPayload` in tests for constructing payloads. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4bbcdbe. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: Pim de Haan <[email protected]>

) creating requirement files for docker images --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

…ect#56975) when running in mode where no bazel build is required, we return early, and avoid running all bazel related logic

…ject#56658) To avoid accidentally triggering too many tests from a loose regex filter, this step warns users before proceeding if the filter returns 5+ tests to run on release pipeline. --------- Signed-off-by: kevin <[email protected]> Signed-off-by: Kevin H. Luu <[email protected]>

this avoids the extra calls on bazel build to build the protobuf files, which will load the protobuf compilers. Signed-off-by: Lonnie Liu <[email protected]>

…ay-project#56924) As mentioned in ray-project#51080, separate _GcsSubscriber class, GcsErrorSubscriber class , and GcsLogSubscriber class from the large _raylet.pyx file. --------- Signed-off-by: Evelynn-V <[email protected]>

…6915) https://buildkite.com/ray-project/postmerge-macos/builds/8257#01997de0-1e13-4954-a370-6255f650ca17 mac c++ / java tests broke after ray-project#56514. Fixing by reverting to have tag_defs.cc define the vars in a separate file and not just in the header. Why does this fix it - I have no idea... for some reason the c++ api dynamic linking madness doesn't like the inlined vars on the mac build specifically??? --------- Signed-off-by: dayshah <[email protected]>

This change adds bazel build fixes for java. Previously the bazel build for `all_modules` were failing due to having a dependency on `testonly` target in bazel. Moved the `all_modules` and `all_modules_for_test` separately so that both can be used whereever required. Fixes ray-project#56990 Signed-off-by: Shriraj Bhardwaj <[email protected]>

…ct#57001) - Reports all metrics in milliseconds instead of variable units, which makes it easier to understand at a glance and easier to automate analysis of the stats. - Converts to using a monotonic clock instead of system time, which should not be used for measuring intervals. --------- Signed-off-by: Edward Oakes <[email protected]>

…oject#56558) RayEvent provides a special API, merge, which allows multiple events to be combined into a single event. This reduces gRPC message size, network bandwidth usage, and is essential for scaling task event exports. This PR leverages that feature. Specifically, it clusters events into groups based on (i) entity ID and (ii) event type. Each group is merged into a single event, which is then added to the gRPC message body. The EntityId is a user-defined function, implemented by the event class creator, that determines which events can be safely merged. Test: - CI --------- Signed-off-by: Cuong Nguyen <[email protected]>

Signed-off-by: Gang Zhao <[email protected]> Co-authored-by: Gang Zhao <[email protected]>

Cleaning up some log messages and standardizing on `WithField`. --------- Signed-off-by: Edward Oakes <[email protected]>

Signed-off-by: Jiajun Yao <[email protected]>

python dependencies are already installed in the CI images and do not need to be reinstalled. Signed-off-by: Lonnie Liu <[email protected]>

…train.report (ray-project#56360) The main change here is: * Train workers report validation function + validation config * Controller kicks off validation Ray task and associates its return value with the relevant checkpoint. * Main controller step polls workers and validations, only finishing when both are done. --------- Signed-off-by: Timothy Seah <[email protected]>

) this will stop running rllib flaky tests on every single commit Signed-off-by: Lonnie Liu <[email protected]>

updating regex from fullmatch to match for test filtering Regex match for the following needs to include the suffix variations : `name:entity_recognition_with_llms.aws` or `name:entity_recognition_with_llms.gce` ``` - name: entity_recognition_with_llms # do not use dashes (regex sensitive) frequency: weekly python: "3.11" group: ray-examples team: ml working_dir: //doc/source/ray-overview/examples/entity-recognition-with-llms # use // to access from repo's root cluster: byod: type: llm-cu128 # anyscale/ray-llm:<PR_RAY_VERSION>-py311-cu128 post_build_script: byod_llm_ner.sh # release/ray_release/byod/ cluster_compute: ci/aws.yaml # relative to working_dir run: timeout: 3600 script: bash ci/tests.sh # relative to working_dir variations: - __suffix__: aws # uses default specs above - __suffix__: gce env: gce frequency: manual cluster: cluster_compute: ci/gce.yaml # relative to working_dir ``` --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

Signed-off-by: zac <[email protected]>

…ject#57026) Reverts ray-project#56979 the change is failing minimal install tests.

and add `needs_java` tag to the test. this will allow us to run "normal" tests without java jdk and jre, where and running java related tests seperately Signed-off-by: Lonnie Liu <[email protected]>

…roject#57029) The test was broken by logical merge conflict between ray-project#56558 and a few other PR. This PR fixes a few issues: - `DriverExecutionEvent` was replaced by `DriverLifecycleEvent` so we fix it here - Sort the list of test events in a deterministic order for testing (since the merge function might re-arrange) them Signed-off-by: Cuong Nguyen <[email protected]>

Signed-off-by: Cuong Nguyen <[email protected]>

The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer. Related issue: DATA-1399 --------- Signed-off-by: Goutam V. <[email protected]>

…y-project#56833) Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>

…project#55193) ### Proposal for status based on discussions Ideally we want a way to use a common set of statuses across the codebase, but still limit function return types to just a specific subset of statuses. The way to do this is by moving away from an status enum class with an enum per error, but towards a status namespace with a class per error. Languages like Rust offer pattern matching through enums, but C++ actually offers pattern matching through types. So taking this one step further means that since other languages do errors through enums, C++ should do errors through classes. So here, each error gets its own class and you can do pattern matching on the variant<PossibleErrorClasses> with std::visit. Having the variant gives us a native way to type check and unwrap all the types. We still want a way to ergonomically handle the ok case separately from others, so I'm boxing the variant inside an optional and wrapping the whole thing up in a templated class to offer better names.  --- > [!NOTE] > Introduce variant-based typed status results (`StatusSet`, `StatusSetOr`) with tag error classes and tests, alongside existing legacy Status. > > - **Core**: > - **Typed status API**: Add `StatusT::*` tag classes, `StatusSet<...>` and `StatusSetOr<T, ...>` wrappers, and `overloaded` helper for `std::visit` in `src/ray/common/status.h` (with new `<optional>`/`<variant>` includes). Kept legacy `Status` API alongside. > - **Tests**: > - Add unit tests for `StatusSet` and `StatusSetOr`; refactor existing `StatusTest` cases from `TEST_F` to `TEST` and remove the unused fixture. > - **Misc**: > - Trivial whitespace change in `cgroup_driver_interface.h`. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit b904b5f. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: dayshah <[email protected]>

…rManager [1/2] (ray-project#56930) ## Why are these changes needed? This is the first of two PRs. The end goal is to me able to track in flight requests per tag when applying more functions to env runners. Today, we only track in flight requests for sample() calls. Tomorrow, we want to track in flight requests for other calls separately, like updating weights. We want to track them separately because otherwise tasks can block each other by sending too many requests and clogging the shared number of in flight requests. The next step after this PR will be a follow up PR which introduces tags inside of `sync_env_runner_states` to limit the number of weight updates that are in flight at any given time.  --- > [!NOTE] > Introduce per-tag async request accounting and tagging across EnvRunnerGroup/ActorManager, add foreach_env_runner_async_fetch_ready helper, and remove deprecated worker alias APIs/tests. > > - **EnvRunnerGroup**: > - Add tag support to async APIs: `num_in_flight_async_reqs(tag)`, `foreach_env_runner_async(func, tag, kwargs=...)`, and `fetch_ready_async_reqs(tags=...)`. > - New convenience API: `foreach_env_runner_async_fetch_ready(...)` to fetch ready results (by tag) and immediately enqueue new async calls. > - Remove deprecated worker alias APIs (`probe_unhealthy_workers`, `foreach_worker*`, `local_worker`, `_remote_workers`, `remote_workers`). > - **ActorManager**: > - Track in-flight async requests per tag via `_ActorState.num_in_flight_async_requests_by_tag`; add helpers to increment/decrement and query by tag. > - Enforce per-tag limits in `foreach_actor_async`; maintain `(tag, actor_id)` mapping for in-flight calls; support `num_outstanding_async_reqs(tag)`. > - `fetch_ready_async_reqs(tags=...)` and `_filter_calls_by_tag(tags)` accept None/strings/lists and update per-tag counters on completion; clear per-tag state on actor removal. > - **Tests**: > - Add `test_foreach_env_runner_async_fetch_ready` validating tagged async fetch helper. > - Update rollout worker tests to stop using removed `foreach_env_runner_with_id`. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 172f9ac. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).

…57603) If user tries to pass pure cpu data with `tensor_transport="nixl"`, the nixl_reg_descs in metadata may be none.

ray-project#57267) We’re upgrading Ray’s underlying metrics infrastructure from OpenCensus to OpenTelemetry. As part of this migration, the : character is no longer allowed in metric names (a restriction imposed by OpenTelemetry). One of the major Ray applications affected by this change is vLLM, which previously used : in its metric names. The good news is that vLLM has already migrated from using : to _. Starting with the next Ray release, a warning will be printed to help other Ray applications prepare for this change. In the following release, Ray will switch fully to the OpenTelemetry backend, and the : character will be officially disallowed in metric names. Test: - CI Tested locally with - A ray program with all of its default metrics: ``` import ray @ray.remote def f(): print("hi") ray.get(f.remote()) ``` ``` > python ray_program.py (f pid=59068) hi ``` - A ray program with a custom metric with the character ":" ``` import ray from ray.util.metrics import Counter counter = Counter("my:test", description="w00t") @ray.remote def f(): counter.inc() print("hi") ray.get([f.remote() for _ in range(5)]) ``` ``` > python ray_program.py /Users/can/Projects/ray/python/ray/util/metrics.py:77: FutureWarning: Metric name my:test contains a : character, which is no longer allowed. Please migrate to the new metric name format. This will be an error in the future. (f pid=74565) hi (pid=74567) /Users/can/Projects/ray/python/ray/util/metrics.py:77: FutureWarning: Metric name my:test contains a : character, which is no longer allowed. Please migrate to the new metric name format. This will be an error in the future. (f pid=74564) hi [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) (pid=74563) /Users/can/Projects/ray/python/ray/util/metrics.py:77: FutureWarning: Metric name my:test contains a : character, which is no longer allowed. Please migrate to the new metric name format. This will be an error in the future. [repeated 4x across cluster] ``` Signed-off-by: Cuong Nguyen <[email protected]>

) Signed-off-by: joshlee <[email protected]>

* Enables V2 env variable for most `ray/air/tests` and `py_doctest[air]`. * Explicitly disables V2 for certain tests that test legacy behavior such as the old Tune integration, or marks them for migration in a followup. * Removes some unmaintained/deprecated code (`DummyTrainer`, `custom_trainer`) * Moved some tests to `ray/data` for utils that live in `ray.air` but are only used in Ray Data. Note that I didn't move the util source code to Data as that would have been a large change, but I moved the tests out of this deprecated "air tests" suite: * `test_arrow` -> `test_arrow_type_conversion` * `test_data_batch_conversion` * `test_object_extension` * `test_tensor_extension` * `test_torch_tensor_utils` --------- Signed-off-by: Justin Yu <[email protected]>

## Why are these changes needed? RayNodeType is not a label for this metric because its emitted by the node_manager.cc, which does not track this value. <img width="1580" height="469" alt="Screenshot 2025-10-09 at 1 39 49 PM" src="https://github.com/user-attachments/assets/293e5c9e-47bc-465e-b69f-ad5e3ee008eb" /> Quick fix to get this graph working again.  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Alan Guo <[email protected]>

Add support for min, max and time weighted average as aggregation function over timeseries data --------- Signed-off-by: abrar <[email protected]>

This change ensures that `reconfigure` is invoked with both `user_config` and `rank` under the following conditions: 1. The user has implemented `reconfigure`, and redeploys with an updated `user_config`. 2. The user has implemented `reconfigure`, the replica rank changes, user has rank as param in the `reconfigure` method signature, and `deployment_config` contains a `user_config`. Reconfigure is also invoked at replica startup if, user has implemented `reconfigure` and has provided some `user_config` fixes ray-project#57048 --------- Signed-off-by: abrar <[email protected]>

## Why are these changes needed? The current `test_map.py` is too big which makes it harder to navigate and they can increase CI by 10+ min when they’re retried. This PR split it into - `test_with_column.py‎` - `test_map_batches.py‎` This makes `test_map.py` smaller from 3246 lines down to 1426.  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: You-Cheng Lin (Owen) <[email protected]>

## Why are these changes needed? Currently, Hash shuffle relies on default settings of `max_retries` for tasks which is incorrect. Instead, we explicitly configure it to be retrying indefinitely. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <[email protected]>

## Why are these changes needed? Ray Data Expr to Pyarrow Compute Expression converter. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <[email protected]> Signed-off-by: Goutam <[email protected]>

… fetched (ray-project#57613)   ## Why are these changes needed? 1. Fixing prefetcher loop to avoid being blocked on the next block being fetched 2. Adding missing metrics for `BatchIterator` ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <[email protected]>

## Why are these changes needed? By adding a delay, we can prevent middle layers in the routing path from coalescing separate chunks together, since we assert each chunk arrives independently. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: akyang-anyscale <[email protected]>

This supports the following functionality to enable Azure on release test pipeline: - Upload working directory to Azure - Upload metrics/results json file to Azure - Download files (metrics/results json file) from Azure - Helper function to parse ABFSS URI into account, container, and path This PR is broken down from ray-project#57252 which also includes a sample hello world test on Azure to test e2e. Proof that it works: https://buildkite.com/ray-project/release/builds/62278 --------- Signed-off-by: kevin <[email protected]>

## Why are these changes needed? Subject ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Alexey Kudinkin <[email protected]>

…ace"" (ray-project#57255) Reverts ray-project#57248 Please review ray-project@227b841 which is the fix for a previously accepted PR. --------- Signed-off-by: Cuong Nguyen <[email protected]>

…#56613) Signed-off-by: joshlee <[email protected]>

## Why are these changes needed? 1. Updating `streaming_split` tests to increase coverage 2. Updated release tests to test `equal=False` cases ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <[email protected]>

Signed-off-by: Seiji Eicher <[email protected]>

ray-project#57539) Signed-off-by: Rueian <[email protected]>

…-project#57596) Signed-off-by: Mengjin Yan <[email protected]>

``` REGRESSION 16.07%: single_client_tasks_and_get_batch (THROUGHPUT) regresses from 5.261194854317881 to 4.415850247347108 in microbenchmark.json REGRESSION 11.29%: placement_group_create/removal (THROUGHPUT) regresses from 751.064903521573 to 666.2773993932936 in microbenchmark.json REGRESSION 11.14%: single_client_tasks_sync (THROUGHPUT) regresses from 900.96738867954 to 800.5633840543425 in microbenchmark.json REGRESSION 10.14%: actors_per_second (THROUGHPUT) regresses from 566.4200586217125 to 508.9808896382363 in benchmarks/many_actors.json REGRESSION 8.91%: 1_1_async_actor_calls_sync (THROUGHPUT) regresses from 1374.047824125402 to 1251.6025859481733 in microbenchmark.json REGRESSION 8.70%: single_client_get_calls_Plasma_Store (THROUGHPUT) regresses from 9176.686326011131 to 8378.589542828342 in microbenchmark.json REGRESSION 6.79%: 1_n_async_actor_calls_async (THROUGHPUT) regresses from 6964.257909926722 to 6491.439808045807 in microbenchmark.json REGRESSION 6.68%: 1_1_actor_calls_sync (THROUGHPUT) regresses from 1826.440590474467 to 1704.5035425495187 in microbenchmark.json REGRESSION 6.62%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.142098493341212 to 12.272053704608084 in microbenchmark.json REGRESSION 6.61%: n_n_actor_calls_async (THROUGHPUT) regresses from 24808.730524179864 to 23168.372784365154 in microbenchmark.json REGRESSION 6.19%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 4795.051007052156 to 4498.3519827438895 in microbenchmark.json REGRESSION 6.05%: 1_1_actor_calls_async (THROUGHPUT) regresses from 7925.658042658907 to 7445.809146193413 in microbenchmark.json REGRESSION 5.20%: n_n_async_actor_calls_async (THROUGHPUT) regresses from 21602.16598513169 to 20479.183697143773 in microbenchmark.json REGRESSION 5.18%: single_client_tasks_async (THROUGHPUT) regresses from 7418.67591750316 to 7034.736389002367 in microbenchmark.json REGRESSION 5.16%: single_client_put_gigabytes (THROUGHPUT) regresses from 20.350152593657818 to 19.30103208209274 in microbenchmark.json REGRESSION 5.11%: tasks_per_second (THROUGHPUT) regresses from 388.36439061844453 to 368.5098005212305 in benchmarks/many_tasks.json REGRESSION 4.27%: pgs_per_second (THROUGHPUT) regresses from 13.028153672527967 to 12.47149444972938 in benchmarks/many_pgs.json REGRESSION 2.33%: single_client_wait_1k_refs (THROUGHPUT) regresses from 4.8129125825624035 to 4.700920788730696 in microbenchmark.json REGRESSION 1.88%: client__put_gigabytes (THROUGHPUT) regresses from 0.10294244610916167 to 0.10100883378233687 in microbenchmark.json REGRESSION 1.17%: 1_n_actor_calls_async (THROUGHPUT) regresses from 7563.474741840271 to 7474.798821945149 in microbenchmark.json REGRESSION 46.59%: stage_3_creation_time (LATENCY) regresses from 1.8725192546844482 to 2.7449533939361572 in stress_tests/stress_test_many_tasks.json REGRESSION 41.39%: dashboard_p99_latency_ms (LATENCY) regresses from 35.162 to 49.716 in benchmarks/many_nodes.json REGRESSION 23.68%: dashboard_p99_latency_ms (LATENCY) regresses from 188.103 to 232.641 in benchmarks/many_pgs.json REGRESSION 20.72%: dashboard_p99_latency_ms (LATENCY) regresses from 3446.344 to 4160.517 in benchmarks/many_actors.json REGRESSION 15.85%: dashboard_p50_latency_ms (LATENCY) regresses from 4.26 to 4.935 in benchmarks/many_pgs.json REGRESSION 15.44%: dashboard_p50_latency_ms (LATENCY) regresses from 5.544 to 6.4 in benchmarks/many_tasks.json REGRESSION 12.31%: avg_iteration_time (LATENCY) regresses from 1.2971700072288512 to 1.4568077945709228 in stress_tests/stress_test_dead_actors.json REGRESSION 11.15%: 1000000_queued_time (LATENCY) regresses from 179.146127773 to 199.115312395 in scalability/single_node.json REGRESSION 10.11%: dashboard_p95_latency_ms (LATENCY) regresses from 2612.102 to 2876.107 in benchmarks/many_actors.json REGRESSION 8.41%: dashboard_p50_latency_ms (LATENCY) regresses from 10.833 to 11.744 in benchmarks/many_actors.json REGRESSION 8.25%: stage_1_avg_iteration_time (LATENCY) regresses from 12.93162693977356 to 13.99826169013977 in stress_tests/stress_test_many_tasks.json REGRESSION 6.18%: stage_2_avg_iteration_time (LATENCY) regresses from 33.983641386032104 to 36.08304100036621 in stress_tests/stress_test_many_tasks.json REGRESSION 6.07%: 3000_returns_time (LATENCY) regresses from 5.790547841000006 to 6.1422604579999955 in scalability/single_node.json REGRESSION 4.99%: 10000_args_time (LATENCY) regresses from 19.077259766999987 to 20.028864411 in scalability/single_node.json REGRESSION 4.97%: dashboard_p95_latency_ms (LATENCY) regresses from 10.799 to 11.336 in benchmarks/many_pgs.json REGRESSION 4.83%: dashboard_p95_latency_ms (LATENCY) regresses from 13.338 to 13.982 in benchmarks/many_nodes.json REGRESSION 4.73%: 10000_get_time (LATENCY) regresses from 24.000713915999995 to 25.136106761999997 in scalability/single_node.json REGRESSION 4.40%: stage_3_time (LATENCY) regresses from 1821.4706330299377 to 1901.6145586967468 in stress_tests/stress_test_many_tasks.json REGRESSION 2.90%: dashboard_p50_latency_ms (LATENCY) regresses from 6.935 to 7.136 in benchmarks/many_nodes.json REGRESSION 2.74%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 13.41017694899999 to 13.777409734000003 in scalability/object_store.json REGRESSION 1.89%: stage_0_time (LATENCY) regresses from 7.735846281051636 to 7.882433891296387 in stress_tests/stress_test_many_tasks.json REGRESSION 1.62%: avg_pg_remove_time_ms (LATENCY) regresses from 1.396923734234321 to 1.419495533032749 in stress_tests/stress_test_placement_group.json REGRESSION 1.34%: stage_4_spread (LATENCY) regresses from 0.5580154959703073 to 0.565494858742622 in stress_tests/stress_test_many_tasks.json REGRESSION 0.09%: avg_pg_create_time_ms (LATENCY) regresses from 1.5636188018035782 to 1.5650917102100173 in stress_tests/stress_test_placement_group.json ``` Signed-off-by: kevin <[email protected]>

…ct#56952) So that the images that share the same post_build_script name but have different depset file can have unique image tag --------- Signed-off-by: kevin <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

…ct#57614) After some experimentation, the main culprit for the performance degredation is actually from the lag probe being too aggressive. The default lag probe previously being 250ms caused as much as a 20% degredation in performance when used in combination with with enabling io_context metrics. Setting the default to abouve 60s seems to mitigate the issue. To come to this conclusion we tested with the below: Trail 1: ~400 actors/s <-- way too slow -RAY_emit_main_serivce_metrics = 1 Trial 2: ~500+ actor/s <-- where we want to be -RAY_emit_main_serivce_metrics = -1 Trial 3: ~500+ actor/s -RAY_emit_main_serivce_metrics = 1 -RAY_io_context_event_loop_lag_collection_interval_ms = -1 <-- disabled Trial 4: ~500+ actor/s <-- bingo! -RAY_emit_main_serivce_metrics = 1 -RAY_io_context_event_loop_lag_collection_interval_ms = 6000 The default value of 250ms combined with the increased usage of lag probes when the metrics are enabled causes enough degredation as to be noticable. Increasing the interval sufficiently seems to be the way to go to avoid this and have our metrics. --------- Signed-off-by: zac <[email protected]> Co-authored-by: Ibrahim Rabbani <[email protected]>

@GuyStone

…oject#56785) # [Data][LLM] Add Video Processor and vllm example for ray data ## Summary This PR introduces a production-grade video preprocessing stage for Ray LLM batch pipelines. It parses video sources from OpenAI-style chat messages, resolves sources with stream-first I/O and optional caching, decodes via PyAV (FFmpeg), samples frames by fps or fixed count, and outputs frames (PIL or NumPy) with per-video metadata. It aligns with the image stage’s conventions while addressing video-specific needs. ## Motivation Many multimodal and VLM workloads require a reliable, performant, and testable video preprocessing step with: - Explicit and deterministic frame sampling - Stream-first networking with optional caching - Robust error reporting and retries - Consistent outputs for downstream model inference (PIL/NumPy) ## Key Features - Source extraction: supports “video” and “video_url” in OpenAI chat message content. - Source resolution: HTTP(S), data URI, and local file; cache_mode: memory | disk | auto. - Sampling: fps-based or num_frames-based; num_frames deterministically takes the first N decoded frames; optional max_sampled_frames cap; bounded target generation. - Outputs: PIL or NumPy; channels_first for NumPy; optional PIL preprocessing (resize/crop/convert) with NumPy path routed through PIL for consistency. - Reliability: decode in a thread, async orchestration with concurrency limits, retries with backoff, strict “no frames → failure,” enriched error metadata. - Safety caps: bounded target list and per-source decode frame count to avoid pathological behavior on malformed streams. ## Design and Implementation Notes - HTTP extraction lives in a shared `_util.py` (HTTPConnection) to: - Centralize networking behavior (timeouts, retries, chunked download, file download) - Improve testability (single patch point across stages) - Ensure consistent semantics between image and video stages - Avoid coupling the stage to a dataset/planner runtime - Optional dependencies (PyAV, Pillow, NumPy) are imported lazily with clear error messages. - Decode is CPU-bound; it runs in a thread, while async orchestration ensures concurrency limits and order-preserving emission. we cannot directly reuse download API (ray-project#55824) - That commit introduces a Ray Data download expression at the planner/op/expression layer for high-throughput dataset ingestion and better block-size estimation. It is ideal for offline ETL and bulk workloads. - This PR targets an online batch inference stage implemented as a UDF with asyncio and a lightweight HTTP utility, optimized for low latency, per-request ordering, and controlled concurrency within a batch stage. - Directly embedding the planner path would: - Introduce scheduling and planning overhead unsuited for low-latency UDFs - Complicate execution semantics (order preservation, per-request grouping) - Increase dependency surface (Data planner) inside LLM batch stages - Recommended composition: use Ray Data’s download expression offline to materialize bytes/local files; then feed those paths/data into this video stage for decoding/processing. ## Usage - Package entrypoints: - `PrepareVideoStage` / `PrepareVideoUDF` in `ray.llm._internal.batch.stages.prepare_video_stage` ### Example 1: Use the UDF in a batch stage - Input rows must contain `messages` in OpenAI chat format with video entries (“video” or “video_url”). ``` from ray.llm._internal.batch.stages.prepare_video_stage import PrepareVideoUDF udf = PrepareVideoUDF( data_column="__data", expected_input_keys=["messages"], sampling={"num_frames": 4}, # or {"fps": 3.0} output_format="numpy", # or "pil" channels_first=True, # NumPy-only cache_mode="auto", # "memory" | "disk" | "auto" cache_dir="/tmp/video-cache", # optional for disk/auto ) batch = { "__data": [ { "messages": [ { "content": [ {"type": "video", "video": "https://host/video.mp4"}, {"type": "video_url", "video_url": {"url": "file:///data/v2.mp4"}}, ] } ] } ] } # Consume async UDF async def run(): outs = [] async for out in udf(batch): outs.append(out["__data"][0]) # out["video"] -> List[List[Frame]] # out["video_meta"] -> List[Dict] per video (size, timestamps, num_frames, failed, etc.) return outs ``` We can directly refer to this test： ``` text pytest test_prepare_video_stage.py -v ============================================== test session starts =============================================== platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0 -- /ray-workspace/ray/python/requirements/llm/.venv/bin/python3 cachedir: .pytest_cache rootdir: /ray-workspace/ray configfile: pytest.ini plugins: anyio-4.10.0, asyncio-1.2.0 asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function collected 19 items test_prepare_video_stage.py::test_udf_extract_and_process_basic PASSED [ 5%] test_prepare_video_stage.py::test_num_frames_sampling_exact PASSED [ 10%] test_prepare_video_stage.py::test_data_uri_handling PASSED [ 15%] test_prepare_video_stage.py::test_local_file_path_handling PASSED [ 21%] test_prepare_video_stage.py::test_auto_cache_to_disk_when_num_frames PASSED [ 26%] test_prepare_video_stage.py::test_av_missing_import_error_metadata PASSED [ 31%] test_prepare_video_stage.py::test_multiple_videos_order_preserved PASSED [ 36%] test_prepare_video_stage.py::test_preprocess_convert_numpy_consistency PASSED [ 42%] test_prepare_video_stage.py::test_bytesio_format_guess_fallback PASSED [ 47%] test_prepare_video_stage.py::test_retries_success_and_counts PASSED [ 52%] test_prepare_video_stage.py::test_non_retriable_no_retry PASSED [ 57%] test_prepare_video_stage.py::test_target_cap_limits_frames PASSED [ 63%] test_prepare_video_stage.py::test_numpy_output_channels_first PASSED [ 68%] test_prepare_video_stage.py::test_strict_no_fallback_when_no_frames PASSED [ 73%] test_prepare_video_stage.py::test_e2e_with_pyav_synth PASSED [ 78%] test_prepare_video_stage.py::test_e2e_num_frames_pil PASSED [ 84%] test_prepare_video_stage.py::test_e2e_fps_sampling PASSED [ 89%] test_prepare_video_stage.py::test_e2e_preprocess_resize_numpy_channels_first PASSED [ 94%] test_prepare_video_stage.py::test_e2e_max_sampled_frames_cap PASSED [100%] =============================================== 19 passed in 1.77s =============================================== ``` ### Example 2: Multimodal inference with vLLM - Sample a few frames, preprocess to PIL/NumPy, then feed frames as images to your multimodal prompt (one common pattern is to select top-k frames and attach them as image inputs). ``` from transformers import AutoProcessor from vllm import LLM, SamplingParams from ray.llm._internal.batch.stages.prepare_video_stage import VideoProcessor import asyncio import tempfile import os from PIL import Image from qwen_vl_utils import process_vision_info async def process_video_with_vlm(): # 1. Extract video frames vp = VideoProcessor( sampling={"num_frames": 4}, output_format="pil", preprocess={"resize": {"size": [384, 384]}, "convert": "RGB"}, ) frames_and_meta = await vp.process(["./2-20.mp4"]) frames = frames_and_meta[0]["frames"] print(f"Extracted {len(frames)} frames") # 2. Save frames to temporary files with tempfile.TemporaryDirectory() as tmp_dir: image_paths = [] for i, frame in enumerate(frames): temp_path = os.path.join(tmp_dir, f"frame_{i}.jpg") frame.save(temp_path) image_paths.append(temp_path) # 3. Initialize model MODEL_PATH = "/vllm-workspace/tmp/vlm" llm = LLM( model=MODEL_PATH, limit_mm_per_prompt={"image": 10}, trust_remote_code=True, enforce_eager=True ) # 4. Construct messages messages = [ {"role": "system", "content": "You are a helpful assistant."}, { "role": "user", "content": [ *[{"type": "image", "image": path} for path in image_paths], {"type": "text", "text": "Summarize the content of this video"} ] } ] # 5. Process input processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True) prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, _, = process_vision_info(messages) # 6. Generate results sampling_params = SamplingParams( temperature=0.1, top_p=0.001, max_tokens=512 ) outputs = llm.generate([{ "prompt": prompt, "multi_modal_data": {"image": image_inputs} }], sampling_params=sampling_params) print("Generated result:", outputs[0].outputs[0].text) asyncio.run(process_video_with_vlm()) ``` Notes: - If your vLLM interface expects byte-encoded images, convert PIL frames to bytes (e.g., PNG/JPEG) before passing. - If it expects NumPy tensors, use `output_format="numpy"` with `channels_first` as needed. ## Dependencies - Runtime (optional-by-use): `av` (PyAV), `pillow`, `numpy`. - Tests: require the above; E2E tests synthesize MP4 with PyAV and validate decode/processing. ## Backward Compatibility - Additive functionality; does not break existing stages or APIs. ## Testing - Unit tests cover: - fps/num_frames sampling, data URI, local path, auto cache to disk - Missing dependency metadata, order preservation, NumPy output/channel ordering - BytesIO format guess fallback, retries and non-retriable paths, sampling caps - E2E tests (default enabled) synthesize MP4s with PyAV and validate end-to-end behavior. ## Relate Issue close ray-project#56424 ray-project#56767 cc @GuyStone @richardliaw @nrghosh  --- > [!NOTE] > Introduce a production-ready video preprocessing stage (with sampling, caching, and metadata), centralize HTTP utilities, add env-based tunables, and refactor image stage to use the shared HTTP client with comprehensive tests. > > - **LLM Batch Env Tunables**: > - Add `python/ray/llm/_internal/batch/envs.py` with lazy env getters: `RAY_LLM_BATCH_MAX_TARGETS`, `RAY_LLM_BATCH_MAX_DECODE_FRAMES`. > - **Shared Utilities**: > - New `python/ray/llm/_internal/batch/stages/_util.py` providing `HTTPConnection` (sync/async GET, bytes/text/json, chunked download, file download). > - **Image Stage Refactor**: > - `prepare_image_stage.py`: replace inline `HTTPConnection` with `_util.HTTPConnection`; adjust tests to patch new path. > - **Video Processing Stage (example)**: > - Add `video_processor.py` implementing `VideoProcessor`, `PrepareVideoUDF`, `PrepareVideoStage` with HTTP/data/local source resolution, optional disk/memory cache, PyAV decode, fps/num_frames sampling, PIL/NumPy outputs, preprocessing, retries/backoff, and metadata. > - Add CLI `main.py` and README for usage. > - **Tests**: > - New `test_video_processor.py` covering sampling modes, data URI/local/http sources, caching, retries, numpy/PIL outputs, preprocessing, caps, and E2E with PyAV. > - Update `test_prepare_image_stage.py` to patch `_util.HTTPConnection`. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit b1ee418. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: PAN <[email protected]> Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: Richard Liaw <[email protected]>

## Why are these changes needed? See ray-project#57226 . I got my env working. Was in py 3.13 by accident  ## Related issue number Solves ray-project#57226  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Henry Lindeman <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>

…ray-project#57541) `test_api.py::test_max_constructor_retry_count` was failing for windows. Tried to expand the timeout on wait_on_condition at the last part of the test to 20s - 40s and added a debug statement to check how far the counter increments to. It goes up in a varying value but I was able to observe 9-12, not reaching 13. Did some drilling and seems like for our ray actor worker process is forked to be created for Linux and Windows uses `CreateProcessA`, which builds process from scratch each time ran unlike forking. And this difference is causing the number of counts for windows to be growing more slowly IIUC. The call for windows with `CreateProcessA` is available [here](https://github.com/ray-project/ray/blob/1296dc4699a3c1681fe3de6dd9f63af51d287582/src/ray/util/process.cc#L171), and forking for Linux is availabe here. Hence, the solution is to alleviate the test's resource requirement by launching 4->3 replicas and attempting on less number of retries to satisfy both linux and windows. --------- Signed-off-by: doyoung <[email protected]>

…#57535) part 1 of ray-project#56149 1. move `_serialized_policy_def` into `AutoscalingPolicy` from `AutoscalingConfig`. We need this in order to reuse `AutoscalingPolicy` for application-level autoscaling. 2. Make `autoscaling_policy` a top-level config in `ServeApplicationSchema`. --------- Signed-off-by: abrar <[email protected]>

…ay-project#57561)

Signed-off-by: zac <[email protected]>

khluu and others added 30 commits September 26, 2025 15:25

Revert "[core] Deprecate LIFO/FIFO worker killing policies" (ray-proj…

b34929b

…ect#56960) Reverts ray-project#56314 Signed-off-by: Edward Oakes <[email protected]>

[core] remove unused redis start fixture (ray-project#56971)

d09174e

dead code; no longer used anywhere. Signed-off-by: Lonnie Liu <[email protected]>

[deps] generating requirement files for docker images (ray-project#56634

e91fbcf

) creating requirement files for docker images --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

[wheel] return early on build if no bazel build is required (ray-proj…

7175043

…ect#56975) when running in mode where no bazel build is required, we return early, and avoid running all bazel related logic

[ci] build generated protobuf in ray core binaries (ray-project#56969)

bc27a32

this avoids the extra calls on bazel build to build the protobuf files, which will load the protobuf compilers. Signed-off-by: Lonnie Liu <[email protected]>

Support azure and abfss in LLM config (ray-project#56441)

f7aab78

Signed-off-by: Gang Zhao <[email protected]> Co-authored-by: Gang Zhao <[email protected]>

[core] Clean up some gcs_actor_scheduler.cc logs (ray-project#57003)

ede8bbb

Cleaning up some log messages and standardizing on `WithField`. --------- Signed-off-by: Edward Oakes <[email protected]>

Add CODEOWNER for dashboard serve and data modules (ray-project#57006)

66e0840

Signed-off-by: Jiajun Yao <[email protected]>

[ci] use --no-deps to avoid installing dependencies (ray-project#56979)

031433f

python dependencies are already installed in the CI images and do not need to be reinstalled. Signed-off-by: Lonnie Liu <[email protected]>

[rllib] change rllib flaky tests tag to rllib_flaky (ray-project#56991

8113350

) this will stop running rllib flaky tests on every single commit Signed-off-by: Lonnie Liu <[email protected]>

update test

13c07cb

Signed-off-by: zac <[email protected]>

Revert "[ci] use --no-deps to avoid installing dependencies" (ray-pro…

c6f9897

…ject#57026) Reverts ray-project#56979 the change is failing minimal install tests.

[core] split out test that needs java (ray-project#57007)

17f6140

and add `needs_java` tag to the test. this will allow us to run "normal" tests without java jdk and jre, where and running java related tests seperately Signed-off-by: Lonnie Liu <[email protected]>

[core] fix test state api and dashboard flakiness (ray-project#56966)

97f33ac

Signed-off-by: Cuong Nguyen <[email protected]>

[Core] Delete gcs based actor scheduling tests in test_advanced_5 (ra…

bac0d9c

…y-project#56833) Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>

Qiaolin-Yu and others added 29 commits October 9, 2025 18:51

[core][rdt] Fix deregisterMem with none when using nixl (ray-project#…

456a619

…57603) If user tries to pass pure cpu data with `tensor_transport="nixl"`, the nixl_reg_descs in metadata may be none.

[core] Make WaitForActorRefDeleted RPC Fault Tolerant (ray-project#57116

cef177f

) Signed-off-by: joshlee <[email protected]>

Add autoscaling metrics aggregation function (ray-project#56871)

f07db71

Add support for min, max and time weighted average as aggregation function over timeseries data --------- Signed-off-by: abrar <[email protected]>

Revert "Revert "[core][metric] Redefine gcs STATS using metric interf…

6361ffa

…ace"" (ray-project#57255) Reverts ray-project#57248 Please review ray-project@227b841 which is the fix for a previously accepted PR. --------- Signed-off-by: Cuong Nguyen <[email protected]>

[core] Clean up worker/raylet client pools on node death (ray-project…

9ba5c20

…#56613) Signed-off-by: joshlee <[email protected]>

[llm][ci] Enable Serve LLM doc tests (ray-project#57619)

f68d290

Signed-off-by: Seiji Eicher <[email protected]>

[core][autosaler][v2] fix: num_workers_dict calculation by observation (

b29d138

ray-project#57539) Signed-off-by: Rueian <[email protected]>

[Doc][Core] Update accelerator-type Doc Regarding CPU-only Nodes (ray…

f6cce8b

…-project#57596) Signed-off-by: Mengjin Yan <[email protected]>

Introduce sub-tabs with full Grafana dashboard embeds on Metrics tab (r…

d7ac83e

…ay-project#57561)

Merge branch 'master' into HEAD

e6394f0

Signed-off-by: zac <[email protected]>

ZacAttack added the go add ONLY when ready to merge, run all tests label Oct 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[no merge, testing premerge issues] #57674

[no merge, testing premerge issues] #57674

Uh oh!

ZacAttack commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

66 participants

[no merge, testing premerge issues] #57674

Are you sure you want to change the base?

[no merge, testing premerge issues] #57674

Uh oh!

Conversation

ZacAttack commented Oct 13, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

66 participants