Parallelization of xgboost_ray.train results in Error

For cross-validation, we are attempting to parallelize xgboost_ray.train using ray.remote tasks. Each remote task uses a different cross-validation split of the data. Unfortunately, parallelizing xgboost_ray.train results in the below errors. If the same tasks are run sequentially, rather than in parallel, no errors occur. Below is a reproducible example based on xgboost_ray's documentation's example. If this is run locally, it completes. It's only when parallelized on a remote Ray cluster that it results in the below errors. 

```
from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer
import ray
import numpy as np

ray.init("ray://ray-cluster-head-svc:10001")

@ray.remote
def f():
    train_x, train_y = load_breast_cancer(return_X_y=True)

    # the amount of data is increased because with small amounts of data no ConnectionError is encountered
    train_x = np.tile(train_x, (10000, 1))
    train_y = np.repeat(train_y, 10000)

    train_set = RayDMatrix(train_x, train_y)
    evals_result = {}
    bst = train(
        {
            "objective": "binary:logistic",
            "eval_metric": ["logloss", "error"],
        },
        train_set,
        evals_result=evals_result,
        evals=[(train_set, "train")],
        verbose_eval=False,
        ray_params=RayParams(
            num_actors=2,  # Number of remote actors
            cpus_per_actor=1))

    bst.save_model("model.xgb")
    print("Final training error: {:.4f}".format(
        evals_result["train"]["error"][-1]))

# The same data is used for both tasks for the sake of this example, but for cross-validation, 
#  the data passed to each task would be unique
ray.get([f.remote() for i in range(2)])
```

Traceback:
```
(freenome-risk-prediction-py3.10) ➜  risk-prediction git:(main) ✗ python risk_prediction/mre.py
(autoscaler +9s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +9s) Adding 2 node(s) of type main.
(autoscaler +19s) Resized to 1 CPUs.
(autoscaler +19s) Adding 1 node(s) of type main.
(autoscaler +29s) Resized to 2 CPUs.
(autoscaler +45s) Adding 1 node(s) of type main.
(f pid=61, ip=10.76.2.5) [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.
Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in `ray_client_server_[port].out`
2023-08-10 14:11:59,613 WARNING dataclient.py:403 -- Encountered connection issues in the data channel. Attempting to reconnect.
2023-08-10 14:12:16,715 ERROR dataclient.py:330 -- Unrecoverable error in data channel.
Traceback (most recent call last):
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 455, in _get
    for chunk in resp:
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 324, in _get_object_iterator
    for chunk in self.server.GetObject(req, *args, **kwargs):
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/grpc/_channel.py", line 475, in __next__
    return self._next()
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/grpc/_channel.py", line 881, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.FAILED_PRECONDITION
        details = "'NoneType' object is not iterable"
        debug_error_string = "UNKNOWN:Error received from peer ipv4:10.123.5.189:10001 {created_time:"2023-08-10T14:12:17.138947014+00:00", grpc_status:9, grpc_message:"\'NoneType\' object is not iterable"}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zcarrico/risk-prediction/risk_prediction/mre.py", line 32, in <module>
    ray.get([f.remote() for i in range(2)])
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 102, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 434, in get
    res = self._get(to_get, op_timeout)
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 477, in _get
    raise decode_exception(e)
ConnectionError: GRPC connection failed: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.FAILED_PRECONDITION
        details = "'NoneType' object is not iterable"
        debug_error_string = "UNKNOWN:Error received from peer ipv4:10.123.5.189:10001 {created_time:"2023-08-10T14:12:17.138947014+00:00", grpc_status:9, grpc_message:"\'NoneType\' object is not iterable"}"
>
```

I expect the same error will be encountered when parallelizing remote tasks for nested cross-validation for HPO.

Please let me know if you have any questions and thank you for the help and the great xgboost_ray library!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallelization of xgboost_ray.train results in Error #291

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parallelization of xgboost_ray.train results in Error #291

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions