-
Notifications
You must be signed in to change notification settings - Fork 36
Description
For cross-validation, we are attempting to parallelize xgboost_ray.train using ray.remote tasks. Each remote task uses a different cross-validation split of the data. Unfortunately, parallelizing xgboost_ray.train results in the below errors. If the same tasks are run sequentially, rather than in parallel, no errors occur. Below is a reproducible example based on xgboost_ray's documentation's example. If this is run locally, it completes. It's only when parallelized on a remote Ray cluster that it results in the below errors.
from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer
import ray
import numpy as np
ray.init("ray://ray-cluster-head-svc:10001")
@ray.remote
def f():
train_x, train_y = load_breast_cancer(return_X_y=True)
# the amount of data is increased because with small amounts of data no ConnectionError is encountered
train_x = np.tile(train_x, (10000, 1))
train_y = np.repeat(train_y, 10000)
train_set = RayDMatrix(train_x, train_y)
evals_result = {}
bst = train(
{
"objective": "binary:logistic",
"eval_metric": ["logloss", "error"],
},
train_set,
evals_result=evals_result,
evals=[(train_set, "train")],
verbose_eval=False,
ray_params=RayParams(
num_actors=2, # Number of remote actors
cpus_per_actor=1))
bst.save_model("model.xgb")
print("Final training error: {:.4f}".format(
evals_result["train"]["error"][-1]))
# The same data is used for both tasks for the sake of this example, but for cross-validation,
# the data passed to each task would be unique
ray.get([f.remote() for i in range(2)])
Traceback:
(freenome-risk-prediction-py3.10) ➜ risk-prediction git:(main) ✗ python risk_prediction/mre.py
(autoscaler +9s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +9s) Adding 2 node(s) of type main.
(autoscaler +19s) Resized to 1 CPUs.
(autoscaler +19s) Adding 1 node(s) of type main.
(autoscaler +29s) Resized to 2 CPUs.
(autoscaler +45s) Adding 1 node(s) of type main.
(f pid=61, ip=10.76.2.5) [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.
Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in `ray_client_server_[port].out`
2023-08-10 14:11:59,613 WARNING dataclient.py:403 -- Encountered connection issues in the data channel. Attempting to reconnect.
2023-08-10 14:12:16,715 ERROR dataclient.py:330 -- Unrecoverable error in data channel.
Traceback (most recent call last):
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 455, in _get
for chunk in resp:
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 324, in _get_object_iterator
for chunk in self.server.GetObject(req, *args, **kwargs):
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/grpc/_channel.py", line 475, in __next__
return self._next()
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/grpc/_channel.py", line 881, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.FAILED_PRECONDITION
details = "'NoneType' object is not iterable"
debug_error_string = "UNKNOWN:Error received from peer ipv4:10.123.5.189:10001 {created_time:"2023-08-10T14:12:17.138947014+00:00", grpc_status:9, grpc_message:"\'NoneType\' object is not iterable"}"
>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zcarrico/risk-prediction/risk_prediction/mre.py", line 32, in <module>
ray.get([f.remote() for i in range(2)])
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 102, in wrapper
return getattr(ray, func.__name__)(*args, **kwargs)
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/api.py", line 42, in get
return self.worker.get(vals, timeout=timeout)
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 434, in get
res = self._get(to_get, op_timeout)
File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 477, in _get
raise decode_exception(e)
ConnectionError: GRPC connection failed: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.FAILED_PRECONDITION
details = "'NoneType' object is not iterable"
debug_error_string = "UNKNOWN:Error received from peer ipv4:10.123.5.189:10001 {created_time:"2023-08-10T14:12:17.138947014+00:00", grpc_status:9, grpc_message:"\'NoneType\' object is not iterable"}"
>
I expect the same error will be encountered when parallelizing remote tasks for nested cross-validation for HPO.
Please let me know if you have any questions and thank you for the help and the great xgboost_ray library!