-
Notifications
You must be signed in to change notification settings - Fork 20
Open
Description
I'm running a RapidsMPF shuffle with Ray and occasionally (twice in 15 runs) I see UCX errors during the shuffle. This was with rapidsmpf 25.06. Will revisit with 25.08:
2025-09-03 10:49:24.917 | INFO | nemo_curator.backends.experimental.ray_actor_pool.executor:_create_rapidsmpf_actors:219 - UCXX setup complete
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,967 E 1323505 1324986] logging.cc:118: Unhandled exception: N4ucxx13TimedOutErrorE. what(): Operation timed out
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,974 E 1323505 1324986] logging.cc:125: Stack trace:
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0x152da9a) [0x7ffff6eefa9a] ray::operator<<()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0x15309a2) [0x7ffff6ef29a2] ray::TerminateHandler()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da) [0x7ffff57ff0da]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt10unexpectedv+0) [0x7ffff57e9a55] std::unexpected()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb391) [0x7ffff57ff391]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/libucxx/lib64/libucxx.so(+0x42158) [0x7fcf0c153158]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/libucxx/lib64/libucxx.so(_ZN4ucxx8Endpoint6createEP13ucp_ep_params+0x1ee) [0x7fcf0c16999e] ucxx::Endpoint::create()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/libucxx/lib64/libucxx.so(_ZN4ucxx31createEndpointFromWorkerAddressESt10shared_ptrINS_6WorkerEES0_INS_7AddressEEb+0x150) [0x7fcf0c170d70] ucxx::createEndpointFromWorkerAddress()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/libucxx/lib64/libucxx.so(_ZN4ucxx6Worker31createEndpointFromWorkerAddressESt10shared_ptrINS_7AddressEEb+0x109) [0x7fcf0c1916b9] ucxx::Worker::createEndpointFromWorkerAddress()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/librapidsmpf/lib64/librapidsmpf.so(+0x51420) [0x7fcf0c013420]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/librapidsmpf/lib64/librapidsmpf.so(+0x3ddc1) [0x7fcf0bfffdc1]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/librapidsmpf/lib64/librapidsmpf.so(_ZN9rapidsmpf4ucxx4UCXX4recvEiNS_3TagESt10unique_ptrINS_6BufferESt14default_deleteIS4_EE+0x4b) [0x7fcf0c01768b] rapidsmpf::ucxx::UCXX::recv()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/librapidsmpf/lib64/librapidsmpf.so(_ZN9rapidsmpf8shuffler8Shuffler8ProgressclEv+0xa7f) [0x7fcf0c043faf] rapidsmpf::shuffler::Shuffler::Progress::operator()()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/librapidsmpf/lib64/librapidsmpf.so(_ZN9rapidsmpf14ProgressThread13FunctionStateclEv+0x14) [0x7fcf0c030624] rapidsmpf::ProgressThread::FunctionState::operator()()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/librapidsmpf/lib64/librapidsmpf.so(_ZN9rapidsmpf14ProgressThread10event_loopEv+0x59) [0x7fcf0c030989] rapidsmpf::ProgressThread::event_loop()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/librapidsmpf/lib64/librapidsmpf.so(+0x6e10a) [0x7fcf0c03010a]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7ffff5830db4]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7ffff7d14aa4]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7ffff7da1c3c]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) *** SIGABRT received at time=1756921830 on cpu 88 ***
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) PC: @ 0x7ffff7d16b2c (unknown) pthread_kill
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) @ 0x7ffff7cbd330 3504 (unknown)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) @ 0x7ffff7cbd27e 32 raise
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) @ 0x7ffff7ca08ff 192 abort
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) @ 0x7ffff6ef297a 1184 ray::TerminateHandler()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) @ 0x7ffff57ff0da 16 (unknown)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) @ 0x7ffff57e9a55 16 std::terminate()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) @ 0x7ffff57ff391 48 __cxa_throw
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) @ 0x7fcf0c153158 48 (unknown)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) @ 0x7fb9ec267360 (unknown) (unknown)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,978 E 1323505 1324986] logging.cc:474: *** SIGABRT received at time=1756921830 on cpu 88 ***
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,978 E 1323505 1324986] logging.cc:474: PC: @ 0x7ffff7d16b2c (unknown) pthread_kill
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,979 E 1323505 1324986] logging.cc:474: @ 0x7ffff7cbd330 3504 (unknown)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,979 E 1323505 1324986] logging.cc:474: @ 0x7ffff7cbd27e 32 raise
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,979 E 1323505 1324986] logging.cc:474: @ 0x7ffff7ca08ff 192 abort
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,979 E 1323505 1324986] logging.cc:474: @ 0x7ffff6ef297a 1184 ray::TerminateHandler()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,980 E 1323505 1324986] logging.cc:474: @ 0x7ffff57ff0da 16 (unknown)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,980 E 1323505 1324986] logging.cc:474: @ 0x7ffff57e9a55 16 std::terminate()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,980 E 1323505 1324986] logging.cc:474: @ 0x7ffff57ff391 48 __cxa_throw
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,980 E 1323505 1324986] logging.cc:474: @ 0x7fcf0c153158 48 (unknown)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,982 E 1323505 1324986] logging.cc:474: @ 0x7fb9ec267360 (unknown) (unknown)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) Fatal Python error: Aborted
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, yaml._yaml, _brotli, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._json, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.typing.builtins.itertools, numba.cpython.builtins.math, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, cupy_backends.cuda._softlink, cupy_backends.cuda.api._runtime_enum, cupy_backends.cuda.api.runtime, cupy._util, cupy.cuda.device, fastrlock.rlock, cupy.cuda.memory_hook, cupy_backends.cuda.stream, cupy.cuda.graph, cupy.cuda.stream, cupy_backends.cuda.api._driver_enum, cupy_backends.cuda.api.driver, cupy.cuda.memory, cupy._core.internal, cupy._core._carray, cupy.cuda.texture, cupy.cuda.function, cupy_backends.cuda.libs.nvrtc, cupy.cuda.pinned_memory, cupy.cuda.common, cupy.cuda.cub, cupy_backends.cuda.libs.nvtx, cupy.cuda.thrust, cupy._core._dtype, cupy._core._scalar, cupy._core._accelerator, cupy._core._memory_range, cupy._core._fusion_thread_local, cupy._core._kernel, cupy._core._routines_manipulation, cupy._core._routines_binary, cupy._core._optimize_config, cupy._core._cub_reduction, cupy._core._reduction, cupy._core._routines_math, cupy._core._routines_indexing, cupy._core._routines_linalg, cupy._core._routines_logic, cupy._core._routines_sorting, cupy._core._routines_statistics, cupy._core.dlpack, cupy._core.flags, cupy._core.core, cupy._core._fusion_variable, cupy._core._fusion_trace, cupy._core._fusion_kernel, cupy._core.new_fusion, cupy._core.fusion, cupy._core.raw, scipy.sparse._sparsetools, _csparsetools, _cyutility, scipy._cyutility, scipy.sparse._csparsetools, cupy.fft._cache, cupy.fft._callback, cupy.random._bit_generator, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ellip_harm_2, scipy.special._special_ufuncs, scipy.special._gufuncs, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_schur_sqrtm, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, cupy.lib._polynomial, numba.mviewbuf, pynvjitlink._nvjitlinklib, numba.core.typing.cmathdecl.cmath, numba.types.itertools, rmm.pylibrmm.cuda_stream, rmm.pylibrmm.stream, rmm.pylibrmm.helper, cuda.bindings._bindings.cydriver, cuda.bindings.cydriver, cuda.bindings.driver, cuda.bindings._lib.utils, cuda.bindings._bindings.cyruntime_ptds, cuda.bindings._bindings.cyruntime, cuda.bindings._lib.cyruntime.utils, cuda.bindings._lib.cyruntime.cyruntime, cuda.bindings.cyruntime, cuda.bindings.runtime, cuda.bindings.utils._get_handle, rmm.pylibrmm.memory_resource, rmm.pylibrmm.device_buffer, rmm.librmm._logger, rmm.pylibrmm.logger, nvtx._lib.lib, nvtx._lib.profiler, pylibcudf.libcudf.types, pylibcudf.types, pylibcudf.libcudf.aggregation, pylibcudf.aggregation, pylibcudf.gpumemoryview, pylibcudf.utils, pylibcudf._interop_helpers, pylibcudf.table, pylibcudf.filling, pylibcudf.traits, pylibcudf.column, pylibcudf.scalar, pylibcudf.libcudf.binaryop, pylibcudf.binaryop, pylibcudf.column_factories, pylibcudf.concatenate, pylibcudf.contiguous_split, pylibcudf.libcudf.copying, pylibcudf.copying, pylibcudf.libcudf.datetime, pylibcudf.datetime, pylibcudf.experimental, pylibcudf.libcudf.expressions, pylibcudf.groupby, pylibcudf.json, pylibcudf.nvtext.byte_pair_encode, pylibcudf.nvtext.deduplicate, pylibcudf.nvtext.edit_distance, pylibcudf.nvtext.generate_ngrams, pylibcudf.nvtext.jaccard, pylibcudf.nvtext.minhash, pylibcudf.nvtext.ngrams_tokenize, pylibcudf.nvtext.normalize, pylibcudf.nvtext.replace, pylibcudf.nvtext.stemmer, pylibcudf.nvtext.subword_tokenize, pylibcudf.nvtext.tokenize, pylibcudf.nvtext.wordpiece_tokenize, pylibcudf.rolling, pylibcudf.strings.attributes, pylibcudf.libcudf.strings.char_types, pylibcudf.strings.capitalize, pylibcudf.strings.case, pylibcudf.strings.char_types, pylibcudf.libcudf.strings.combine, pylibcudf.strings.combine, pylibcudf.libcudf.strings.regex_flags, pylibcudf.strings.regex_flags, pylibcudf.strings.regex_program, pylibcudf.strings.contains, pylibcudf.strings.convert.convert_booleans, pylibcudf.strings.convert.convert_datetime, pylibcudf.strings.convert.convert_durations, pylibcudf.strings.convert.convert_fixed_point, pylibcudf.strings.convert.convert_floats, pylibcudf.strings.convert.convert_integers, pylibcudf.strings.convert.convert_ipv4, pylibcudf.strings.convert.convert_lists, pylibcudf.strings.convert.convert_urls, pylibcudf.strings.extract, pylibcudf.strings.find, pylibcudf.strings.find_multiple, pylibcudf.strings.findall, pylibcudf.strings.padding, pylibcudf.strings.repeat, pylibcudf.strings.replace, pylibcudf.strings.replace_re, pylibcudf.libcudf.strings.side_type, pylibcudf.strings.side_type, pylibcudf.strings.slice, pylibcudf.strings.split.partition, pylibcudf.strings.split.split, pylibcudf.strings.strip, pylibcudf.libcudf.strings.translate, pylibcudf.strings.translate, pylibcudf.strings.wrap, pylibcudf.interop, pylibcudf.expressions, pylibcudf.hashing, pylibcudf.io.datasource, pylibcudf.libcudf.io.json, pylibcudf.libcudf.io.types, pylibcudf.io.types, pylibcudf.io.avro, pylibcudf.io.csv, pylibcudf.io.json, pylibcudf.io.orc, pylibcudf.io.parquet, pylibcudf.io.parquet_metadata, pylibcudf.io.text, pylibcudf.io.timezone, pylibcudf.jit, pylibcudf.join, pylibcudf.libcudf.labeling, pylibcudf.labeling, pylibcudf.libcudf.lists.combine, pylibcudf.libcudf.lists.contains, pylibcudf.lists, pylibcudf.merge, pylibcudf.null_mask, pylibcudf.partitioning, pylibcudf.quantiles, pylibcudf.libcudf.reduce, pylibcudf.reduce, pylibcudf.libcudf.replace, pylibcudf.replace, pylibcudf.reshape, pylibcudf.libcudf.round, pylibcudf.round, pylibcudf.search, pylibcudf.sorting, pylibcudf.libcudf.stream_compaction, pylibcudf.stream_compaction, pylibcudf.transform, pylibcudf.transpose, pylibcudf.libcudf.unary, pylibcudf.unary, pylibcudf.utilities, numba.cpython.mathimpl.math, numba.cpython.mathimpl.sys, cudf._lib.strings_udf, pyarrow._feather, pylibraft.common.cuda, pylibraft.common.handle, cugraph.structure.graph_primtypes, cugraph.structure.utils_wrapper, cytoolz.utils, cytoolz.itertoolz, cytoolz.functoolz, cytoolz.dicttoolz, cytoolz.recipes, xxhash._xxhash, markupsafe._speedups, scipy.fftpack.convolve, pyarrow._acero, pyarrow._csv, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, pyarrow._orc, tornado.speedups, pylibcugraph.components._connectivity, pylibcugraph.resource_handle, pylibcugraph.graph_properties, pylibcugraph.utils, pylibcugraph.graphs, pylibcugraph.internal_types.edge_id_lookup_result, pylibcugraph.edge_id_lookup_table, pylibcugraph.eigenvector_centrality, pylibcugraph.katz_centrality, pylibcugraph.pagerank, pylibcugraph.personalized_pagerank, pylibcugraph.has_vertex, pylibcugraph.sssp, pylibcugraph.hits, pylibcugraph.node2vec, pylibcugraph.random, pylibcugraph.node2vec_random_walks, pylibcugraph.bfs, pylibcugraph.internal_types.sampling_result, pylibcugraph.uniform_neighbor_sample, pylibcugraph.biased_neighbor_sample, pylibcugraph.homogeneous_uniform_neighbor_sample, pylibcugraph.homogeneous_biased_neighbor_sample, pylibcugraph.heterogeneous_uniform_neighbor_sample, pylibcugraph.heterogeneous_biased_neighbor_sample, pylibcugraph.internal_types.coo, pylibcugraph.negative_sampling, pylibcugraph.core_number, pylibcugraph.k_core, pylibcugraph.two_hop_neighbors, pylibcugraph.louvain, pylibcugraph.triangle_count, pylibcugraph.egonet, pylibcugraph.weakly_connected_components, pylibcugraph.uniform_random_walks, pylibcugraph.biased_random_walks, pylibcugraph.select_random_vertices, pylibcugraph.betweenness_centrality, pylibcugraph.induced_subgraph, pylibcugraph.ecg, pylibcugraph.balanced_cut_clustering, pylibcugraph.spectral_modularity_maximization, pylibcugraph.analyze_clustering_modularity, pylibcugraph.analyze_clustering_edge_cut, pylibcugraph.analyze_clustering_ratio_cut, pylibcugraph.leiden, pylibcugraph.edge_betweenness_centrality, pylibcugraph.generate_rmat_edgelist, pylibcugraph.generate_rmat_edgelists, pylibcugraph.replicate_edgelist, pylibcugraph.k_truss_subgraph, pylibcugraph.jaccard_coefficients, pylibcugraph.overlap_coefficients, pylibcugraph.sorensen_coefficients, pylibcugraph.cosine_coefficients, pylibcugraph.all_pairs_jaccard_coefficients, pylibcugraph.all_pairs_overlap_coefficients, pylibcugraph.all_pairs_sorensen_coefficients, pylibcugraph.all_pairs_cosine_coefficients, pylibcugraph.degrees, pylibcugraph.decompress_to_edgelist, pylibcugraph.renumber_arbitrary_edgelist, pylibcugraph.force_atlas2, pylibcugraph.minimum_spanning_tree, raft_dask.common.comms_utils, raft_dask.common.nccl, cugraph.dask.comms.comms_wrapper, cugraph.utilities.path_retrieval_wrapper, cugraph.structure.graph_primtypes_wrapper, cugraph.dask.structure.replication, cugraph.components.connectivity_wrapper, cugraph.tree.minimum_spanning_tree_wrapper, cugraph.linear_assignment.lap_wrapper, cugraph.layout.force_atlas2_wrapper, rapidsmpf.buffer.buffer, rapidsmpf._detail.exception_handling, rapidsmpf.buffer.spill_manager, rapidsmpf.buffer.resource, rapidsmpf.buffer.packed_data, rapidsmpf.communicator.communicator, rapidsmpf.statistics, rapidsmpf.progress_thread, rapidsmpf.shuffler, ucxx._lib.arr, ucxx._lib.libucxx, rapidsmpf._detail.config_options_get, rapidsmpf.config, rapidsmpf.communicator.ucxx (total: 416)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [1756921763.107471] [eos0022:1323505:0] parser.c:2326 UCX WARN unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [1756921763.107471] [eos0022:1323505:0] parser.c:2326 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [1756921824.962641] [eos0022:1323505:1] endpoint.cpp:143 UCXX WARN Timeout waiting for ucp_ep_create, retrying
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [1756921827.964816] [eos0022:1323505:1] endpoint.cpp:143 UCXX WARN Timeout waiting for ucp_ep_create, retrying
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [1756921830.966939] [eos0022:1323505:1] endpoint.cpp:141 UCXX ERROR Timeout waiting for ucp_ep_create, all attempts failed
2025-09-03 10:50:42.234 | ERROR | nemo_curator.backends.experimental.ray_actor_pool.executor:execute:125 - Error during pipeline execution: The actor died unexpectedly before finishing this task.
class_name: ShuffleStageAdapter
actor_id: baa7bf4dc0fa0b5fc6e5a8cc03000000
pid: 1323505
name: LSHStage-Worker_15
namespace: 04424988-f287-4bb6-bef3-db8aaee9f746
ip: 10.52.48.39
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2025-09-03 10:50:42.234 | INFO | nemo_curator.backends.experimental.ray_actor_pool.executor:execute:135 - Shutting down Ray to clean up all resources...
2025-09-03 10:50:42,619 INFO worker.py:1630 -- Using address ray://10.52.48.4:10001 set in the environment variable RAY_ADDRESS
2025-09-03 10:50:42,619 INFO client_builder.py:242 -- Passing the following kwargs to ray.init() on the server: log_to_driver
Traceback (most recent call last):
File "/home/adattagupta/rpv2-ray/e2e_4tb.py", line 19, in <module>
main()
File "/home/adattagupta/rpv2-ray/e2e_4tb.py", line 16, in main
workflow.run()
File "/opt/Curator/nemo_curator/stages/deduplication/fuzzy/workflow.py", line 299, in run
lsh_tasks = lsh_pipeline.run(executor=executor, initial_tasks=None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/Curator/nemo_curator/pipeline/pipeline.py", line 197, in run
return executor.execute(self.stages, initial_tasks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/Curator/nemo_curator/backends/experimental/ray_actor_pool/executor.py", line 90, in execute
current_tasks = self._execute_lsh_stage(stage, current_tasks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/Curator/nemo_curator/backends/experimental/ray_actor_pool/executor.py", line 376, in _execute_lsh_stage
outputs = self._process_shuffle_stage_with_rapidsmpf_actors(actors, original_input, band_range)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/Curator/nemo_curator/backends/experimental/ray_actor_pool/executor.py", line 311, in _process_shuffle_stage_with_rapidsmpf_actors
_ = list(
^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/util/actor_pool.py", line 170, in get_generator
yield self.get_next_unordered()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/util/actor_pool.py", line 370, in get_next_unordered
return ray.get(future)
^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return getattr(ray, func.__name__)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/util/client/api.py", line 42, in get
return self.worker.get(vals, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/util/client/worker.py", line 433, in get
res = self._get(to_get, op_timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/util/client/worker.py", line 461, in _get
raise err
File "/opt/venv/lib/python3.12/site-packages/ray/util/client/server/server.py", line 491, in _get_object
items = ray.get(objectrefs, timeout=request.timeout)
^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/_private/worker.py", line 2882, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/_private/worker.py", line 970, in get_objects
raise value
^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: ShuffleStageAdapter
actor_id: baa7bf4dc0fa0b5fc6e5a8cc03000000
pid: 1323505
name: LSHStage-Worker_15
namespace: 04424988-f287-4bb6-bef3-db8aaee9f746
ip: 10.52.48.39
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
srun: error: eos0003: task 0: Exited with exit code 1
srun: Terminating StepId=3601091.5
Metadata
Metadata
Assignees
Labels
No labels