Skip to content

Occasional UCX error #486

@ayushdg

Description

@ayushdg

I'm running a RapidsMPF shuffle with Ray and occasionally (twice in 15 runs) I see UCX errors during the shuffle. This was with rapidsmpf 25.06. Will revisit with 25.08:

2025-09-03 10:49:24.917 | INFO     | nemo_curator.backends.experimental.ray_actor_pool.executor:_create_rapidsmpf_actors:219 -     UCXX setup complete
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,967 E 1323505 1324986] logging.cc:118: Unhandled exception: N4ucxx13TimedOutErrorE. what(): Operation timed out
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,974 E 1323505 1324986] logging.cc:125: Stack trace: 
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39)  /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0x152da9a) [0x7ffff6eefa9a] ray::operator<<()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0x15309a2) [0x7ffff6ef29a2] ray::TerminateHandler()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da) [0x7ffff57ff0da]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt10unexpectedv+0) [0x7ffff57e9a55] std::unexpected()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb391) [0x7ffff57ff391]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/libucxx/lib64/libucxx.so(+0x42158) [0x7fcf0c153158]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/libucxx/lib64/libucxx.so(_ZN4ucxx8Endpoint6createEP13ucp_ep_params+0x1ee) [0x7fcf0c16999e] ucxx::Endpoint::create()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/libucxx/lib64/libucxx.so(_ZN4ucxx31createEndpointFromWorkerAddressESt10shared_ptrINS_6WorkerEES0_INS_7AddressEEb+0x150) [0x7fcf0c170d70] ucxx::createEndpointFromWorkerAddress()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/libucxx/lib64/libucxx.so(_ZN4ucxx6Worker31createEndpointFromWorkerAddressESt10shared_ptrINS_7AddressEEb+0x109) [0x7fcf0c1916b9] ucxx::Worker::createEndpointFromWorkerAddress()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/librapidsmpf/lib64/librapidsmpf.so(+0x51420) [0x7fcf0c013420]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/librapidsmpf/lib64/librapidsmpf.so(+0x3ddc1) [0x7fcf0bfffdc1]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/librapidsmpf/lib64/librapidsmpf.so(_ZN9rapidsmpf4ucxx4UCXX4recvEiNS_3TagESt10unique_ptrINS_6BufferESt14default_deleteIS4_EE+0x4b) [0x7fcf0c01768b] rapidsmpf::ucxx::UCXX::recv()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/librapidsmpf/lib64/librapidsmpf.so(_ZN9rapidsmpf8shuffler8Shuffler8ProgressclEv+0xa7f) [0x7fcf0c043faf] rapidsmpf::shuffler::Shuffler::Progress::operator()()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/librapidsmpf/lib64/librapidsmpf.so(_ZN9rapidsmpf14ProgressThread13FunctionStateclEv+0x14) [0x7fcf0c030624] rapidsmpf::ProgressThread::FunctionState::operator()()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/librapidsmpf/lib64/librapidsmpf.so(_ZN9rapidsmpf14ProgressThread10event_loopEv+0x59) [0x7fcf0c030989] rapidsmpf::ProgressThread::event_loop()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /opt/venv/lib/python3.12/site-packages/librapidsmpf/lib64/librapidsmpf.so(+0x6e10a) [0x7fcf0c03010a]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7ffff5830db4]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7ffff7d14aa4]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7ffff7da1c3c]
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) 
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) *** SIGABRT received at time=1756921830 on cpu 88 ***
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) PC: @     0x7ffff7d16b2c  (unknown)  pthread_kill
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39)     @     0x7ffff7cbd330       3504  (unknown)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39)     @     0x7ffff7cbd27e         32  raise
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39)     @     0x7ffff7ca08ff        192  abort
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39)     @     0x7ffff6ef297a       1184  ray::TerminateHandler()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39)     @     0x7ffff57ff0da         16  (unknown)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39)     @     0x7ffff57e9a55         16  std::terminate()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39)     @     0x7ffff57ff391         48  __cxa_throw
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39)     @     0x7fcf0c153158         48  (unknown)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39)     @     0x7fb9ec267360  (unknown)  (unknown)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,978 E 1323505 1324986] logging.cc:474: *** SIGABRT received at time=1756921830 on cpu 88 ***
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,978 E 1323505 1324986] logging.cc:474: PC: @     0x7ffff7d16b2c  (unknown)  pthread_kill
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,979 E 1323505 1324986] logging.cc:474:     @     0x7ffff7cbd330       3504  (unknown)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,979 E 1323505 1324986] logging.cc:474:     @     0x7ffff7cbd27e         32  raise
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,979 E 1323505 1324986] logging.cc:474:     @     0x7ffff7ca08ff        192  abort
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,979 E 1323505 1324986] logging.cc:474:     @     0x7ffff6ef297a       1184  ray::TerminateHandler()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,980 E 1323505 1324986] logging.cc:474:     @     0x7ffff57ff0da         16  (unknown)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,980 E 1323505 1324986] logging.cc:474:     @     0x7ffff57e9a55         16  std::terminate()
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,980 E 1323505 1324986] logging.cc:474:     @     0x7ffff57ff391         48  __cxa_throw
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,980 E 1323505 1324986] logging.cc:474:     @     0x7fcf0c153158         48  (unknown)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [2025-09-03 10:50:30,982 E 1323505 1324986] logging.cc:474:     @     0x7fb9ec267360  (unknown)  (unknown)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) Fatal Python error: Aborted
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) 
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) 
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, yaml._yaml, _brotli, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._json, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.typing.builtins.itertools, numba.cpython.builtins.math, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, cupy_backends.cuda._softlink, cupy_backends.cuda.api._runtime_enum, cupy_backends.cuda.api.runtime, cupy._util, cupy.cuda.device, fastrlock.rlock, cupy.cuda.memory_hook, cupy_backends.cuda.stream, cupy.cuda.graph, cupy.cuda.stream, cupy_backends.cuda.api._driver_enum, cupy_backends.cuda.api.driver, cupy.cuda.memory, cupy._core.internal, cupy._core._carray, cupy.cuda.texture, cupy.cuda.function, cupy_backends.cuda.libs.nvrtc, cupy.cuda.pinned_memory, cupy.cuda.common, cupy.cuda.cub, cupy_backends.cuda.libs.nvtx, cupy.cuda.thrust, cupy._core._dtype, cupy._core._scalar, cupy._core._accelerator, cupy._core._memory_range, cupy._core._fusion_thread_local, cupy._core._kernel, cupy._core._routines_manipulation, cupy._core._routines_binary, cupy._core._optimize_config, cupy._core._cub_reduction, cupy._core._reduction, cupy._core._routines_math, cupy._core._routines_indexing, cupy._core._routines_linalg, cupy._core._routines_logic, cupy._core._routines_sorting, cupy._core._routines_statistics, cupy._core.dlpack, cupy._core.flags, cupy._core.core, cupy._core._fusion_variable, cupy._core._fusion_trace, cupy._core._fusion_kernel, cupy._core.new_fusion, cupy._core.fusion, cupy._core.raw, scipy.sparse._sparsetools, _csparsetools, _cyutility, scipy._cyutility, scipy.sparse._csparsetools, cupy.fft._cache, cupy.fft._callback, cupy.random._bit_generator, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ellip_harm_2, scipy.special._special_ufuncs, scipy.special._gufuncs, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_schur_sqrtm, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, cupy.lib._polynomial, numba.mviewbuf, pynvjitlink._nvjitlinklib, numba.core.typing.cmathdecl.cmath, numba.types.itertools, rmm.pylibrmm.cuda_stream, rmm.pylibrmm.stream, rmm.pylibrmm.helper, cuda.bindings._bindings.cydriver, cuda.bindings.cydriver, cuda.bindings.driver, cuda.bindings._lib.utils, cuda.bindings._bindings.cyruntime_ptds, cuda.bindings._bindings.cyruntime, cuda.bindings._lib.cyruntime.utils, cuda.bindings._lib.cyruntime.cyruntime, cuda.bindings.cyruntime, cuda.bindings.runtime, cuda.bindings.utils._get_handle, rmm.pylibrmm.memory_resource, rmm.pylibrmm.device_buffer, rmm.librmm._logger, rmm.pylibrmm.logger, nvtx._lib.lib, nvtx._lib.profiler, pylibcudf.libcudf.types, pylibcudf.types, pylibcudf.libcudf.aggregation, pylibcudf.aggregation, pylibcudf.gpumemoryview, pylibcudf.utils, pylibcudf._interop_helpers, pylibcudf.table, pylibcudf.filling, pylibcudf.traits, pylibcudf.column, pylibcudf.scalar, pylibcudf.libcudf.binaryop, pylibcudf.binaryop, pylibcudf.column_factories, pylibcudf.concatenate, pylibcudf.contiguous_split, pylibcudf.libcudf.copying, pylibcudf.copying, pylibcudf.libcudf.datetime, pylibcudf.datetime, pylibcudf.experimental, pylibcudf.libcudf.expressions, pylibcudf.groupby, pylibcudf.json, pylibcudf.nvtext.byte_pair_encode, pylibcudf.nvtext.deduplicate, pylibcudf.nvtext.edit_distance, pylibcudf.nvtext.generate_ngrams, pylibcudf.nvtext.jaccard, pylibcudf.nvtext.minhash, pylibcudf.nvtext.ngrams_tokenize, pylibcudf.nvtext.normalize, pylibcudf.nvtext.replace, pylibcudf.nvtext.stemmer, pylibcudf.nvtext.subword_tokenize, pylibcudf.nvtext.tokenize, pylibcudf.nvtext.wordpiece_tokenize, pylibcudf.rolling, pylibcudf.strings.attributes, pylibcudf.libcudf.strings.char_types, pylibcudf.strings.capitalize, pylibcudf.strings.case, pylibcudf.strings.char_types, pylibcudf.libcudf.strings.combine, pylibcudf.strings.combine, pylibcudf.libcudf.strings.regex_flags, pylibcudf.strings.regex_flags, pylibcudf.strings.regex_program, pylibcudf.strings.contains, pylibcudf.strings.convert.convert_booleans, pylibcudf.strings.convert.convert_datetime, pylibcudf.strings.convert.convert_durations, pylibcudf.strings.convert.convert_fixed_point, pylibcudf.strings.convert.convert_floats, pylibcudf.strings.convert.convert_integers, pylibcudf.strings.convert.convert_ipv4, pylibcudf.strings.convert.convert_lists, pylibcudf.strings.convert.convert_urls, pylibcudf.strings.extract, pylibcudf.strings.find, pylibcudf.strings.find_multiple, pylibcudf.strings.findall, pylibcudf.strings.padding, pylibcudf.strings.repeat, pylibcudf.strings.replace, pylibcudf.strings.replace_re, pylibcudf.libcudf.strings.side_type, pylibcudf.strings.side_type, pylibcudf.strings.slice, pylibcudf.strings.split.partition, pylibcudf.strings.split.split, pylibcudf.strings.strip, pylibcudf.libcudf.strings.translate, pylibcudf.strings.translate, pylibcudf.strings.wrap, pylibcudf.interop, pylibcudf.expressions, pylibcudf.hashing, pylibcudf.io.datasource, pylibcudf.libcudf.io.json, pylibcudf.libcudf.io.types, pylibcudf.io.types, pylibcudf.io.avro, pylibcudf.io.csv, pylibcudf.io.json, pylibcudf.io.orc, pylibcudf.io.parquet, pylibcudf.io.parquet_metadata, pylibcudf.io.text, pylibcudf.io.timezone, pylibcudf.jit, pylibcudf.join, pylibcudf.libcudf.labeling, pylibcudf.labeling, pylibcudf.libcudf.lists.combine, pylibcudf.libcudf.lists.contains, pylibcudf.lists, pylibcudf.merge, pylibcudf.null_mask, pylibcudf.partitioning, pylibcudf.quantiles, pylibcudf.libcudf.reduce, pylibcudf.reduce, pylibcudf.libcudf.replace, pylibcudf.replace, pylibcudf.reshape, pylibcudf.libcudf.round, pylibcudf.round, pylibcudf.search, pylibcudf.sorting, pylibcudf.libcudf.stream_compaction, pylibcudf.stream_compaction, pylibcudf.transform, pylibcudf.transpose, pylibcudf.libcudf.unary, pylibcudf.unary, pylibcudf.utilities, numba.cpython.mathimpl.math, numba.cpython.mathimpl.sys, cudf._lib.strings_udf, pyarrow._feather, pylibraft.common.cuda, pylibraft.common.handle, cugraph.structure.graph_primtypes, cugraph.structure.utils_wrapper, cytoolz.utils, cytoolz.itertoolz, cytoolz.functoolz, cytoolz.dicttoolz, cytoolz.recipes, xxhash._xxhash, markupsafe._speedups, scipy.fftpack.convolve, pyarrow._acero, pyarrow._csv, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, pyarrow._orc, tornado.speedups, pylibcugraph.components._connectivity, pylibcugraph.resource_handle, pylibcugraph.graph_properties, pylibcugraph.utils, pylibcugraph.graphs, pylibcugraph.internal_types.edge_id_lookup_result, pylibcugraph.edge_id_lookup_table, pylibcugraph.eigenvector_centrality, pylibcugraph.katz_centrality, pylibcugraph.pagerank, pylibcugraph.personalized_pagerank, pylibcugraph.has_vertex, pylibcugraph.sssp, pylibcugraph.hits, pylibcugraph.node2vec, pylibcugraph.random, pylibcugraph.node2vec_random_walks, pylibcugraph.bfs, pylibcugraph.internal_types.sampling_result, pylibcugraph.uniform_neighbor_sample, pylibcugraph.biased_neighbor_sample, pylibcugraph.homogeneous_uniform_neighbor_sample, pylibcugraph.homogeneous_biased_neighbor_sample, pylibcugraph.heterogeneous_uniform_neighbor_sample, pylibcugraph.heterogeneous_biased_neighbor_sample, pylibcugraph.internal_types.coo, pylibcugraph.negative_sampling, pylibcugraph.core_number, pylibcugraph.k_core, pylibcugraph.two_hop_neighbors, pylibcugraph.louvain, pylibcugraph.triangle_count, pylibcugraph.egonet, pylibcugraph.weakly_connected_components, pylibcugraph.uniform_random_walks, pylibcugraph.biased_random_walks, pylibcugraph.select_random_vertices, pylibcugraph.betweenness_centrality, pylibcugraph.induced_subgraph, pylibcugraph.ecg, pylibcugraph.balanced_cut_clustering, pylibcugraph.spectral_modularity_maximization, pylibcugraph.analyze_clustering_modularity, pylibcugraph.analyze_clustering_edge_cut, pylibcugraph.analyze_clustering_ratio_cut, pylibcugraph.leiden, pylibcugraph.edge_betweenness_centrality, pylibcugraph.generate_rmat_edgelist, pylibcugraph.generate_rmat_edgelists, pylibcugraph.replicate_edgelist, pylibcugraph.k_truss_subgraph, pylibcugraph.jaccard_coefficients, pylibcugraph.overlap_coefficients, pylibcugraph.sorensen_coefficients, pylibcugraph.cosine_coefficients, pylibcugraph.all_pairs_jaccard_coefficients, pylibcugraph.all_pairs_overlap_coefficients, pylibcugraph.all_pairs_sorensen_coefficients, pylibcugraph.all_pairs_cosine_coefficients, pylibcugraph.degrees, pylibcugraph.decompress_to_edgelist, pylibcugraph.renumber_arbitrary_edgelist, pylibcugraph.force_atlas2, pylibcugraph.minimum_spanning_tree, raft_dask.common.comms_utils, raft_dask.common.nccl, cugraph.dask.comms.comms_wrapper, cugraph.utilities.path_retrieval_wrapper, cugraph.structure.graph_primtypes_wrapper, cugraph.dask.structure.replication, cugraph.components.connectivity_wrapper, cugraph.tree.minimum_spanning_tree_wrapper, cugraph.linear_assignment.lap_wrapper, cugraph.layout.force_atlas2_wrapper, rapidsmpf.buffer.buffer, rapidsmpf._detail.exception_handling, rapidsmpf.buffer.spill_manager, rapidsmpf.buffer.resource, rapidsmpf.buffer.packed_data, rapidsmpf.communicator.communicator, rapidsmpf.statistics, rapidsmpf.progress_thread, rapidsmpf.shuffler, ucxx._lib.arr, ucxx._lib.libucxx, rapidsmpf._detail.config_options_get, rapidsmpf.config, rapidsmpf.communicator.ucxx (total: 416)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [1756921763.107471] [eos0022:1323505:0]          parser.c:2326 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [1756921763.107471] [eos0022:1323505:0]          parser.c:2326 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [1756921824.962641] [eos0022:1323505:1]      endpoint.cpp:143  UCXX WARN  Timeout waiting for ucp_ep_create, retrying
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [1756921827.964816] [eos0022:1323505:1]      endpoint.cpp:143  UCXX WARN  Timeout waiting for ucp_ep_create, retrying
(ShuffleStageAdapter pid=1323505, ip=10.52.48.39) [1756921830.966939] [eos0022:1323505:1]      endpoint.cpp:141  UCXX ERROR Timeout waiting for ucp_ep_create, all attempts failed
2025-09-03 10:50:42.234 | ERROR    | nemo_curator.backends.experimental.ray_actor_pool.executor:execute:125 - Error during pipeline execution: The actor died unexpectedly before finishing this task.
	class_name: ShuffleStageAdapter
	actor_id: baa7bf4dc0fa0b5fc6e5a8cc03000000
	pid: 1323505
	name: LSHStage-Worker_15
	namespace: 04424988-f287-4bb6-bef3-db8aaee9f746
	ip: 10.52.48.39
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2025-09-03 10:50:42.234 | INFO     | nemo_curator.backends.experimental.ray_actor_pool.executor:execute:135 - Shutting down Ray to clean up all resources...
2025-09-03 10:50:42,619	INFO worker.py:1630 -- Using address ray://10.52.48.4:10001 set in the environment variable RAY_ADDRESS
2025-09-03 10:50:42,619	INFO client_builder.py:242 -- Passing the following kwargs to ray.init() on the server: log_to_driver
Traceback (most recent call last):
  File "/home/adattagupta/rpv2-ray/e2e_4tb.py", line 19, in <module>
    main()
  File "/home/adattagupta/rpv2-ray/e2e_4tb.py", line 16, in main
    workflow.run()
  File "/opt/Curator/nemo_curator/stages/deduplication/fuzzy/workflow.py", line 299, in run
    lsh_tasks = lsh_pipeline.run(executor=executor, initial_tasks=None)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/Curator/nemo_curator/pipeline/pipeline.py", line 197, in run
    return executor.execute(self.stages, initial_tasks)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/Curator/nemo_curator/backends/experimental/ray_actor_pool/executor.py", line 90, in execute
    current_tasks = self._execute_lsh_stage(stage, current_tasks)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/Curator/nemo_curator/backends/experimental/ray_actor_pool/executor.py", line 376, in _execute_lsh_stage
    outputs = self._process_shuffle_stage_with_rapidsmpf_actors(actors, original_input, band_range)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/Curator/nemo_curator/backends/experimental/ray_actor_pool/executor.py", line 311, in _process_shuffle_stage_with_rapidsmpf_actors
    _ = list(
        ^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/util/actor_pool.py", line 170, in get_generator
    yield self.get_next_unordered()
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/util/actor_pool.py", line 370, in get_next_unordered
    return ray.get(future)
           ^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/util/client/worker.py", line 433, in get
    res = self._get(to_get, op_timeout)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/util/client/worker.py", line 461, in _get
    raise err
  File "/opt/venv/lib/python3.12/site-packages/ray/util/client/server/server.py", line 491, in _get_object
    items = ray.get(objectrefs, timeout=request.timeout)
^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  ^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
  ^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/_private/worker.py", line 2882, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  ^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/_private/worker.py", line 970, in get_objects
    raise value
^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
	class_name: ShuffleStageAdapter
	actor_id: baa7bf4dc0fa0b5fc6e5a8cc03000000
	pid: 1323505
	name: LSHStage-Worker_15
	namespace: 04424988-f287-4bb6-bef3-db8aaee9f746
	ip: 10.52.48.39
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
srun: error: eos0003: task 0: Exited with exit code 1
srun: Terminating StepId=3601091.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions