Boweny/single controller #114

bowenyang008 · 2025-07-18T21:38:34Z

No description provided.

… port

dakinggg · 2025-07-18T21:48:07Z

compose_rl/utils/ray_utils.py

+        # NOTE we have to keep all the MCT orchestrator started processes alive with this barrier
+        # until the ray cluster is stopped, otherwise the MCT orchestrator will reclaim the resources
+        # once the processes on a node exit
+        dist.barrier()


I don't think you want a barrier here if im understanding correctly, because then it will timeout once the dist timeout is hit

indeed, didn't find any good way to enable unlimited timeout for barrier (torch has one that accepts arbitrary long timeout but still not simply unlimited), so added a comment that we may still need a SyncActor

In that case, do we want to use the longest reasonable timeout until we add support for a SyncActor (instead of using the default timeout)?

my plan is just to use SyncActor again in a following PR, it is a bit to define longest reasonable timeout, e.g., we can define it to 1 year or more, but that's a bit odd

dakinggg · 2025-07-18T21:51:55Z

tests/test_single_controller.py

+
+    def init_model(self, model_name: str):
+        """Initialize the model."""
+        self.model = AutoModelForCausalLM.from_pretrained(


i don't think you want torch_dtype auto, might end up with bf16 master weights. should always be float 32. i know this is just a test, just noting

removed it and seems still working

rithwik-db · 2025-07-21T22:31:48Z

compose_rl/utils/ray_utils.py

+    Returns:
+        bool: True if Ray is setting CUDA_VISIBLE_DEVICES, False otherwise
+    """
+    return os.environ.get(


Is this something we want to support or should we disable this feature entirely?

rithwik-db · 2025-07-21T22:43:26Z

tests/test_single_controller.py

+        os.environ['RANK'] = str(rank)
+
+        # Set LOCAL_RANK based on Ray GPU allocation
+        os.environ['LOCAL_RANK'] = '0' if is_cuda_visible_devices_set(


Seems that setting all nodes' LOCAL_RANK=0 would lead to lack of 2D parallelism compatibility down the line, right?

rithwik-db · 2025-07-21T23:10:58Z

tests/test_single_controller.py

+        self.master_port = master_port
+
+        # Set up basic environment variables
+        os.environ['WORLD_SIZE'] = str(world_size)


Would updating these env vars lead to issues with the base (gloo) process group defined here or is that process group no longer relevant?

rithwik-db · 2025-07-22T01:53:11Z

compose_rl/algorithms/online/generation_utils/vllm_utils.py

        'GPU': 1,
        'CPU': 1,
-        'worker_node': 1,
-    }] * tensor_parallel_size * num_engines
+        'worker_node': 0,


Curious what the impact of changing worker_node to 0 means in this case? I would assume it makes sense, but wanted to understand if you knew why it was 1 originally.

ah, good catch, the default should be 1 and am allowing 0 basically for single node testing otherwise it is forcing allocating an entire node for inference engine

rithwik-db · 2025-07-22T02:12:42Z

tests/test_single_controller.py

+    def init_train_process_group(self):
+        """Initialize the distributed process group."""
+        # Initialize process group
+        dist.init_process_group(timeout=timedelta(seconds=30))


Also curious about the implications about using dist.init_process_group here and using the vllm_utils.py specific init_process_group for creating the vllm engine...

bowenyang008 added 30 commits June 25, 2025 08:57

runs but not limited master_port

8c1516c

hack local rank

b0b57d9

composer launch works; torchrun somehow only allows use the same init…

9e70dbb

… port

clean up

bc72f3e

timeout to 30s

461c8c4

update script

e7803bb

None evn

feb62d1

change internval

237373c

try break and nodes instead

4193ee3

condition

fd63dcb

rm resources

5c3f61d

sleep it

ff227c4

use dist barrier to block

172c676

context manager

f417e5b

try to not release port

b29ba44

use ray actor

9a30028

another way to get ip address

12fbcef

ray init

4ad2e65

ray remote

8df20ec

barrier

76c73a3

half subprocess

8f51968

try w/o port

a00369c

claude fix; questionable

a777c54

do not raise

dd97cf2

revert back to subprocess

09c92ed

mix init

cef2b9e

manual stop

9d93001

two gpus runs

9dc7a74

tensor parallel size 8

c3db295

update world size

824f086

bowenyang008 added 9 commits July 18, 2025 21:39

rm files

8f51d37

revert changes

1a0dcef

rm file

86f3c26

revert chagne

544ed9d

revert change

96e444e

explicit device bundle

f76dfe0

format

a4de0e7

format

208c436

revert change

0671e07

dakinggg reviewed Jul 18, 2025

View reviewed changes

bowenyang008 added 7 commits July 18, 2025 22:15

barrier comment

2e33fab

formatting

2394332

format

b832364

revert

edd1517

format

2e6480b

format

ed9b2ff

format

969d361

bowenyang008 marked this pull request as ready for review July 21, 2025 20:44

bowenyang008 requested review from bcui-db, gupta-abhay, abaheti95 and jdchang1 as code owners July 21, 2025 20:44

bowenyang008 requested review from ethantang-db and irenedea July 21, 2025 20:51

remove auto_dtype

57e99b4

rithwik-db reviewed Jul 21, 2025

View reviewed changes

rithwik-db reviewed Jul 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Boweny/single controller #114

Boweny/single controller #114

bowenyang008 commented Jul 18, 2025

Uh oh!

dakinggg Jul 18, 2025

Uh oh!

bowenyang008 Jul 21, 2025

Uh oh!

rithwik-db Jul 21, 2025 •

edited

Loading

Uh oh!

bowenyang008 Jul 22, 2025

Uh oh!

dakinggg Jul 18, 2025

Uh oh!

bowenyang008 Jul 22, 2025

Uh oh!

rithwik-db Jul 21, 2025 •

edited

Loading

Uh oh!

rithwik-db Jul 21, 2025 •

edited

Loading

Uh oh!

rithwik-db Jul 21, 2025 •

edited

Loading

Uh oh!

rithwik-db Jul 22, 2025

Uh oh!

bowenyang008 Jul 22, 2025

Uh oh!

rithwik-db Jul 22, 2025

Uh oh!

Uh oh!

Boweny/single controller #114

Are you sure you want to change the base?

Boweny/single controller #114

Conversation

bowenyang008 commented Jul 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rithwik-db Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rithwik-db Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rithwik-db Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rithwik-db Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rithwik-db Jul 21, 2025 •

edited

Loading

rithwik-db Jul 21, 2025 •

edited

Loading

rithwik-db Jul 21, 2025 •

edited

Loading

rithwik-db Jul 21, 2025 •

edited

Loading