[WIP][EP][Failover] Migrate out alive requests when part of ep unit is down #281

KuilongCui · 2025-07-24T13:10:57Z

failover migration P0 ok
FAILOVER_MIGRATING not necessary ok
maybe concurrency migration
timeout for failover migration
random.choice -> rr

llumnix/dp_manager.py

llumnix/global_scheduler/migration_scheduler.py

s5u13b · 2025-07-28T03:10:35Z

llumnix/global_scheduler/migration_scheduler.py

-                        migration_policy: MigrationPolicy) -> List[Tuple[str, str]]:
-        src_instance_infos, dst_instance_infos = self.migration_base_filter.filter_instances(instance_info.values())
+                        migration_policy: MigrationPolicy,
+                        skip_broken_unit: bool = True) -> List[Tuple[str, str]]:


why add skip_broken_unit?

For migration with MigrationType == FAILOVER_MIGRATION, we cannot filter out BROKEN instances since they are the source of migration.

skip_broken_unit is deleted.

llumnix/llumlet/local_migration_scheduler.py

s5u13b · 2025-07-28T03:31:54Z

llumnix/llumlet/migration_coordinator.py

-            except Exception as e:
-                log_instance_exception(e, dst_instance_id, "migrate_out", migrate_out_request.request_id)
-            migrated_request_list.extend(migrated_request)
-            if len(migrated_request) == 0 and migrate_out_request.eom:


where is eom?

while self.has_migration_slot() and (not migrate_out_request.eom):

s5u13b · 2025-07-28T03:33:54Z

llumnix/manager.py

                    dst_instance_actor = self.instances[dst_instance_id]
                    asyncio.create_task(
                        asyncio_wait_for_ray_remote_call_with_timeout(
                            self.instances[src_instance_id].migrate_out,
                            dst_instance_actor, dst_instance_id, migration_type
                        )
                    )
+
+            if not exist_failover_migration_task:


why not put these logic in push migrations? if len(failover_migration_tasks) >0, return empty normal_migratio_tasks

github-actions · 2025-07-28T05:55:54Z

test_request_trace[rayqueue-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	-3034053628.3221	-3034053628.3221	-3034053628.3221	-3034053628.3221	-3034053628.3221
across_llumlet_latency	3034053629.6314	3034053629.6314	3034053629.6314	3034053629.6314	3034053629.6314
across_engine_latency	0.1503	0.1503	0.1503	0.1503	0.1503
process_model_outputs_latency	0.4304	0.3999	0.5778	0.3760	0.5819
engine_step_latency	34.0110	33.8785	34.9004	33.7532	34.9672
step_postprocess_latency	0.0000	0.0000	0.0000	0.0000	0.0000
across_async_put_queue_thread_latency	0.0158	0.0146	0.0226	0.0132	0.0229
across_async_put_queue_actor_latency	0.0380	0.0364	0.0533	0.0329	0.0546
across_queue_client_latency	0.0332	0.0337	0.0443	0.0271	0.0450
queue_rpc_latency	0.2946	0.2773	0.4441	0.2608	0.4576
api_server_get_queue_latency	0.2038	0.1944	0.2737	0.1865	0.2793
across_request_streams_latency	0.0742	0.0556	0.1875	0.0517	0.1971

test_request_trace[zmq-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	-3034193812.7334	-3034193812.7334	-3034193812.7334	-3034193812.7334	-3034193812.7334
across_llumlet_latency	3034193814.1497	3034193814.1497	3034193814.1497	3034193814.1497	3034193814.1497
across_engine_latency	0.1772	0.1772	0.1772	0.1772	0.1772
process_model_outputs_latency	0.4446	0.4095	0.5755	0.3886	0.5757
engine_step_latency	33.8888	33.7867	34.7845	33.7341	34.8798
step_postprocess_latency	0.0000	0.0000	0.0000	0.0000	0.0000
across_async_put_queue_thread_latency	0.0167	0.0153	0.0246	0.0142	0.0251
across_async_put_queue_actor_latency	0.0390	0.0358	0.0620	0.0338	0.0641
across_queue_client_latency	0.0332	0.0301	0.0521	0.0271	0.0536
queue_rpc_latency	0.2149	0.1925	0.3769	0.1747	0.3920
api_server_get_queue_latency	0.1122	0.1082	0.1623	0.0966	0.1668
across_request_streams_latency	0.0418	0.0316	0.1111	0.0271	0.1162

github-actions · 2025-07-28T06:14:19Z

test_request_trace[rayqueue-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	-3035159780.3036	-3035159780.3036	-3035159780.3036	-3035159780.3036	-3035159780.3036
across_llumlet_latency	3035159781.5923	3035159781.5923	3035159781.5923	3035159781.5923	3035159781.5923
across_engine_latency	0.1591	0.1591	0.1591	0.1591	0.1591
process_model_outputs_latency	0.4277	0.4055	0.5604	0.3815	0.5623
engine_step_latency	33.8438	33.7276	34.8042	33.6335	34.9053
step_postprocess_latency	0.0000	0.0000	0.0000	0.0000	0.0000
across_async_put_queue_thread_latency	0.0156	0.0146	0.0214	0.0137	0.0218
across_async_put_queue_actor_latency	0.0386	0.0370	0.0563	0.0326	0.0579
across_queue_client_latency	0.0357	0.0351	0.0523	0.0278	0.0539
queue_rpc_latency	0.2212	0.2074	0.3650	0.1799	0.3789
api_server_get_queue_latency	0.1128	0.1072	0.1628	0.0947	0.1669
across_request_streams_latency	0.0420	0.0316	0.1105	0.0275	0.1156

test_request_trace[zmq-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	-3035296725.9502	-3035296725.9502	-3035296725.9502	-3035296725.9502	-3035296725.9502
across_llumlet_latency	3035296727.1860	3035296727.1860	3035296727.1860	3035296727.1860	3035296727.1860
across_engine_latency	0.1527	0.1527	0.1527	0.1527	0.1527
process_model_outputs_latency	0.4426	0.4181	0.5654	0.3890	0.5686
engine_step_latency	33.8631	33.7474	34.7569	33.7326	34.8497
step_postprocess_latency	0.0000	0.0000	0.0000	0.0000	0.0000
across_async_put_queue_thread_latency	0.0159	0.0148	0.0223	0.0140	0.0228
across_async_put_queue_actor_latency	0.0397	0.0375	0.0537	0.0347	0.0543
across_queue_client_latency	0.0362	0.0335	0.0507	0.0305	0.0511
queue_rpc_latency	0.2120	0.1942	0.3442	0.1768	0.3553
api_server_get_queue_latency	0.1116	0.1061	0.1604	0.0924	0.1638
across_request_streams_latency	0.0424	0.0314	0.1104	0.0281	0.1156

github-actions · 2025-07-28T12:00:27Z

test_request_trace[rayqueue-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	-3055926918.4997	-3055926918.4997	-3055926918.4997	-3055926918.4997	-3055926918.4997
across_llumlet_latency	3055926919.7197	3055926919.7197	3055926919.7197	3055926919.7197	3055926919.7197
across_engine_latency	0.1514	0.1514	0.1514	0.1514	0.1514
process_model_outputs_latency	0.4449	0.4158	0.5958	0.3902	0.5994
engine_step_latency	34.0164	33.9100	34.9387	33.7473	35.0316
step_postprocess_latency	0.0000	0.0000	0.0000	0.0000	0.0000
across_async_put_queue_thread_latency	0.0161	0.0151	0.0233	0.0142	0.0240
across_async_put_queue_actor_latency	0.0482	0.0447	0.0583	0.0423	0.0584
across_queue_client_latency	0.0555	0.0542	0.0631	0.0537	0.0637
queue_rpc_latency	0.2715	0.2545	0.4056	0.2384	0.4181
api_server_get_queue_latency	0.1136	0.1046	0.1651	0.0975	0.1688
across_request_streams_latency	0.0424	0.0290	0.1236	0.0270	0.1302

test_request_trace[zmq-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	-3056063872.9079	-3056063872.9079	-3056063872.9079	-3056063872.9079	-3056063872.9079
across_llumlet_latency	3056063874.1515	3056063874.1515	3056063874.1515	3056063874.1515	3056063874.1515
across_engine_latency	0.1526	0.1526	0.1526	0.1526	0.1526
process_model_outputs_latency	0.4304	0.4045	0.5678	0.3795	0.5712
engine_step_latency	34.0194	33.9190	34.9072	33.7869	34.9954
step_postprocess_latency	0.0000	0.0000	0.0000	0.0000	0.0000
across_async_put_queue_thread_latency	0.0151	0.0144	0.0209	0.0134	0.0213
across_async_put_queue_actor_latency	0.0456	0.0432	0.0585	0.0412	0.0588
across_queue_client_latency	0.0564	0.0544	0.0695	0.0529	0.0703
queue_rpc_latency	0.2741	0.2513	0.4349	0.2348	0.4499
api_server_get_queue_latency	0.1162	0.1083	0.1863	0.0958	0.1923
across_request_streams_latency	0.0437	0.0318	0.1229	0.0269	0.1292

github-actions · 2025-07-29T04:44:45Z

test_request_trace[rayqueue-engine_vLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	1.8060	1.8060	1.8060	1.8060	1.8060
across_llumlet_latency	0.9921	0.9921	0.9921	0.9921	0.9921
across_engine_latency	0.3012	0.3012	0.3012	0.3012	0.3012
process_model_outputs_latency	0.0844	0.0808	0.1077	0.0763	0.1087
engine_step_latency	34.1148	34.1174	34.4124	33.8346	34.4337
step_postprocess_latency	0.0229	0.0120	0.1092	0.0113	0.1186
across_async_put_queue_thread_latency	0.0115	0.0116	0.0123	0.0107	0.0123
across_async_put_queue_actor_latency	0.1782	0.1912	0.2043	0.0494	0.2048
across_queue_client_latency	0.0348	0.0341	0.0423	0.0320	0.0429
queue_rpc_latency	0.2910	0.2879	0.3182	0.2728	0.3191
api_server_get_queue_latency	0.1076	0.1053	0.1208	0.1031	0.1215
across_request_streams_latency	0.0794	0.0647	0.1713	0.0628	0.1798

test_request_trace[zmq-engine_vLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	1.7507	1.7507	1.7507	1.7507	1.7507
across_llumlet_latency	1.6250	1.6250	1.6250	1.6250	1.6250
across_engine_latency	0.3102	0.3102	0.3102	0.3102	0.3102
process_model_outputs_latency	0.0850	0.0778	0.1246	0.0734	0.1272
engine_step_latency	34.2608	34.2193	35.0039	33.9132	35.0704
step_postprocess_latency	0.0267	0.0126	0.1443	0.0107	0.1572
across_async_put_queue_thread_latency	0.0120	0.0117	0.0144	0.0110	0.0146
across_async_put_queue_actor_latency	0.1968	0.1962	0.2223	0.1769	0.2231
across_queue_client_latency	0.0343	0.0318	0.0441	0.0282	0.0445
queue_rpc_latency	0.2786	0.2748	0.3189	0.2574	0.3205
api_server_get_queue_latency	0.1025	0.1021	0.1093	0.0970	0.1093
across_request_streams_latency	0.0759	0.0649	0.1633	0.0581	0.1708

github-actions · 2025-07-29T04:58:11Z

test_simple_benchmark[engine_BladeLLM-False-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	56.51	64.60	73.29	90.35	122.94	66.47
prefill	166.55	245.46	498.04	1984.95	3784.55	521.08

test_simple_benchmark[engine_BladeLLM-True-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	35.95	36.27	37.24	65.45	120.72	40.27
prefill	49.41	79.79	114.47	680.72	1890.38	185.53

test_simple_benchmark[engine_BladeLLM-False-rayqueue-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	55.63	63.71	71.76	90.07	132.40	65.31
prefill	188.87	277.12	708.43	1834.10	4315.46	589.60

test_simple_benchmark[engine_BladeLLM-False-zmq-True-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	35.61	36.01	36.85	37.73	37.98	36.20
prefill	134.92	151.98	199.46	828.04	1574.44	252.86

test_simple_benchmark[engine_BladeLLM-False-zmq-True-True-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	35.13	35.53	35.84	38.00	38.34	35.67
prefill	144.34	184.69	345.82	1242.56	3003.89	408.95

github-actions · 2025-07-29T05:27:36Z

test_simple_benchmark[engine_vLLM-False-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	60.91	68.67	75.49	103.61	201.16	72.70
prefill	195.95	1433.19	16908.96	36904.28	49573.28	9761.00

test_simple_benchmark[engine_vLLM-False-rayqueue-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	57.12	68.75	74.47	95.62	146.90	67.00
prefill	334.81	5977.67	20329.74	38403.65	42062.05	11286.93

github-actions · 2025-07-29T05:27:55Z

test_request_trace[rayqueue-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	1.0824	1.0824	1.0824	1.0824	1.0824
across_llumlet_latency	0.8978	0.8978	0.8978	0.8978	0.8978
across_engine_latency	0.1143	0.1143	0.1143	0.1143	0.1143
process_model_outputs_latency	0.4208	0.4066	0.5435	0.3894	0.5548
engine_step_latency	34.0240	33.8174	35.4233	33.7522	35.5450
step_postprocess_latency	0.0000	0.0000	0.0000	0.0000	0.0000
across_async_put_queue_thread_latency	0.0147	0.0144	0.0171	0.0137	0.0172
across_async_put_queue_actor_latency	0.0345	0.0336	0.0417	0.0312	0.0422
across_queue_client_latency	0.0294	0.0289	0.0356	0.0247	0.0360
queue_rpc_latency	0.2013	0.1953	0.2453	0.1825	0.2479
api_server_get_queue_latency	0.1092	0.1079	0.1291	0.0989	0.1303
across_request_streams_latency	0.0415	0.0318	0.1192	0.0275	0.1270

test_request_trace[zmq-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	1.0955	1.0955	1.0955	1.0955	1.0955
across_llumlet_latency	0.8261	0.8261	0.8261	0.8261	0.8261
across_engine_latency	0.1053	0.1053	0.1053	0.1053	0.1053
process_model_outputs_latency	0.4207	0.4098	0.5430	0.3915	0.5551
engine_step_latency	34.0783	33.8866	35.3813	33.7608	35.4922
step_postprocess_latency	0.0000	0.0000	0.0000	0.0000	0.0000
across_async_put_queue_thread_latency	0.0146	0.0147	0.0160	0.0136	0.0161
across_async_put_queue_actor_latency	0.0371	0.0368	0.0465	0.0325	0.0471
across_queue_client_latency	0.0311	0.0315	0.0354	0.0273	0.0355
queue_rpc_latency	0.1945	0.1939	0.2215	0.1741	0.2218
api_server_get_queue_latency	0.1069	0.1034	0.1300	0.0966	0.1315
across_request_streams_latency	0.0403	0.0305	0.1118	0.0281	0.1190

llumnix/dp_manager.py

github-actions · 2025-07-29T07:57:46Z

test_request_trace[rayqueue-engine_vLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	1.5590	1.5590	1.5590	1.5590	1.5590
across_llumlet_latency	1.0896	1.0896	1.0896	1.0896	1.0896
across_engine_latency	0.3316	0.3316	0.3316	0.3316	0.3316
process_model_outputs_latency	0.0886	0.0838	0.1096	0.0786	0.1099
engine_step_latency	34.1213	34.1293	34.2591	33.8835	34.2592
step_postprocess_latency	0.0205	0.0117	0.0907	0.0113	0.0983
across_async_put_queue_thread_latency	0.0112	0.0113	0.0116	0.0105	0.0116
across_async_put_queue_actor_latency	0.1819	0.1969	0.2123	0.0419	0.2133
across_queue_client_latency	0.0375	0.0317	0.0842	0.0284	0.0889
queue_rpc_latency	0.2699	0.2646	0.3250	0.2452	0.3286
api_server_get_queue_latency	0.1044	0.1027	0.1181	0.0958	0.1189
across_request_streams_latency	0.0769	0.0644	0.1611	0.0575	0.1680

test_request_trace[zmq-engine_vLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	1.3815	1.3815	1.3815	1.3815	1.3815
across_llumlet_latency	1.0345	1.0345	1.0345	1.0345	1.0345
across_engine_latency	0.2964	0.2964	0.2964	0.2964	0.2964
process_model_outputs_latency	0.1029	0.0993	0.1235	0.0847	0.1237
engine_step_latency	34.1323	34.1459	34.3852	33.8193	34.3901
step_postprocess_latency	0.0250	0.0131	0.1150	0.0119	0.1244
across_async_put_queue_thread_latency	0.0128	0.0117	0.0217	0.0109	0.0226
across_async_put_queue_actor_latency	0.1504	0.1917	0.2177	0.0383	0.2193
across_queue_client_latency	0.0298	0.0296	0.0338	0.0255	0.0338
queue_rpc_latency	0.4182	0.4551	0.5910	0.2544	0.5958
api_server_get_queue_latency	0.2156	0.1875	0.3554	0.0965	0.3566
across_request_streams_latency	0.1632	0.0787	0.5011	0.0587	0.5250

github-actions · 2025-07-29T08:02:10Z

test_simple_benchmark[engine_vLLM-False-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	62.01	67.51	73.42	98.08	162.50	68.94
prefill	168.60	4770.14	18232.15	35240.11	55755.38	10831.87

test_simple_benchmark[engine_vLLM-False-rayqueue-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	58.22	69.00	76.27	107.79	225.57	71.00
prefill	268.08	5894.85	19691.37	42098.05	59137.71	12104.35

github-actions · 2025-07-29T08:55:29Z

test_request_trace[rayqueue-engine_vLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	1.3540	1.3540	1.3540	1.3540	1.3540
across_llumlet_latency	1.0604	1.0604	1.0604	1.0604	1.0604
across_engine_latency	0.3436	0.3436	0.3436	0.3436	0.3436
process_model_outputs_latency	0.0808	0.0761	0.1041	0.0741	0.1048
engine_step_latency	34.1942	34.1723	34.5303	33.9361	34.5468
step_postprocess_latency	0.0200	0.0119	0.0855	0.0109	0.0926
across_async_put_queue_thread_latency	0.0124	0.0112	0.0218	0.0107	0.0228
across_async_put_queue_actor_latency	0.1829	0.1999	0.2094	0.0438	0.2095
across_queue_client_latency	0.0336	0.0334	0.0357	0.0322	0.0357
queue_rpc_latency	0.4158	0.4602	0.5599	0.2399	0.5603
api_server_get_queue_latency	0.1832	0.1471	0.3261	0.0914	0.3265
across_request_streams_latency	0.0968	0.0683	0.2205	0.0564	0.2239

test_request_trace[zmq-engine_vLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	1.3320	1.3320	1.3320	1.3320	1.3320
across_llumlet_latency	0.9918	0.9918	0.9918	0.9918	0.9918
across_engine_latency	0.3323	0.3323	0.3323	0.3323	0.3323
process_model_outputs_latency	0.1059	0.1016	0.1457	0.0855	0.1484
engine_step_latency	34.2355	34.1819	34.6641	33.9145	34.6893
step_postprocess_latency	0.0251	0.0126	0.1249	0.0120	0.1358
across_async_put_queue_thread_latency	0.0130	0.0127	0.0152	0.0120	0.0154
across_async_put_queue_actor_latency	0.2082	0.2069	0.2201	0.1979	0.2203
across_queue_client_latency	0.0482	0.0377	0.0999	0.0356	0.1016
queue_rpc_latency	0.3373	0.2928	0.5674	0.2682	0.5805
api_server_get_queue_latency	0.1383	0.1073	0.3623	0.0970	0.3850
across_request_streams_latency	0.0805	0.0687	0.1591	0.0579	0.1658

github-actions · 2025-07-29T09:37:33Z

test_simple_benchmark[engine_vLLM-False-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	62.30	67.72	70.84	93.47	139.50	69.16
prefill	151.59	2845.35	18832.33	36176.36	40136.75	10410.02

test_simple_benchmark[engine_vLLM-False-rayqueue-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	57.43	68.00	73.54	95.67	138.05	65.84
prefill	340.68	6365.50	19670.37	40144.11	50253.23	11934.69

github-actions · 2025-07-29T09:41:52Z

test_simple_benchmark[engine_BladeLLM-False-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	56.15	63.87	71.24	88.51	127.27	64.85
prefill	170.98	239.94	459.12	1225.61	3528.13	438.22

test_simple_benchmark[engine_BladeLLM-True-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	35.80	36.53	37.43	45.94	116.78	39.57
prefill	75.20	97.36	144.94	1668.87	2066.53	271.66

test_simple_benchmark[engine_BladeLLM-False-rayqueue-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	56.06	64.67	72.65	84.03	137.21	67.77
prefill	182.41	270.73	559.64	1482.42	3238.30	498.91

test_simple_benchmark[engine_BladeLLM-False-zmq-True-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	35.61	35.96	36.69	37.66	37.78	36.17
prefill	129.49	143.27	186.47	620.62	1015.73	233.41

test_simple_benchmark[engine_BladeLLM-False-zmq-True-True-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	35.13	35.57	36.21	36.82	37.53	35.59
prefill	168.89	210.65	368.06	1293.10	3079.67	442.76

github-actions · 2025-07-29T11:45:52Z

test_request_trace[rayqueue-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	1.1253	1.1253	1.1253	1.1253	1.1253
across_llumlet_latency	0.8647	0.8647	0.8647	0.8647	0.8647
across_engine_latency	0.1144	0.1144	0.1144	0.1144	0.1144
process_model_outputs_latency	0.4142	0.3949	0.5445	0.3889	0.5570
engine_step_latency	33.9499	33.7545	35.1480	33.6766	35.2496
step_postprocess_latency	0.0000	0.0000	0.0000	0.0000	0.0000
across_async_put_queue_thread_latency	0.0140	0.0138	0.0160	0.0134	0.0161
across_async_put_queue_actor_latency	0.0371	0.0375	0.0434	0.0328	0.0438
across_queue_client_latency	0.0348	0.0340	0.0434	0.0279	0.0437
queue_rpc_latency	0.2040	0.1945	0.2504	0.1806	0.2527
api_server_get_queue_latency	0.1074	0.1061	0.1234	0.0955	0.1239
across_request_streams_latency	0.0405	0.0315	0.1161	0.0266	0.1237

test_request_trace[zmq-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	1.1066	1.1066	1.1066	1.1066	1.1066
across_llumlet_latency	0.8562	0.8562	0.8562	0.8562	0.8562
across_engine_latency	0.0982	0.0982	0.0982	0.0982	0.0982
process_model_outputs_latency	0.6242	0.4310	2.0109	0.4017	2.1470
engine_step_latency	34.0044	33.8200	35.2457	33.7254	35.3562
step_postprocess_latency	0.0000	0.0000	0.0000	0.0000	0.0000
across_async_put_queue_thread_latency	0.0172	0.0153	0.0298	0.0146	0.0309
across_async_put_queue_actor_latency	0.0427	0.0392	0.0654	0.0378	0.0674
across_queue_client_latency	0.0369	0.0362	0.0527	0.0278	0.0540
queue_rpc_latency	0.2037	0.1950	0.2469	0.1793	0.2484
api_server_get_queue_latency	0.1043	0.1013	0.1203	0.0938	0.1209
across_request_streams_latency	0.0386	0.0302	0.1049	0.0262	0.1116

…ger heartbeat loop

…r it

github-actions · 2025-07-31T04:49:29Z

test_simple_benchmark[engine_vLLM-False-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	61.24	67.05	72.39	92.54	135.68	67.19
prefill	172.81	4483.88	17684.44	43440.59	49127.17	10057.68

test_simple_benchmark[engine_vLLM-False-rayqueue-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	57.60	68.45	76.25	104.40	166.37	68.98
prefill	342.97	5723.20	18617.89	43225.41	48430.32	10827.78

github-actions · 2025-07-31T04:52:41Z

test_request_trace[rayqueue-engine_vLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	1.4589	1.4589	1.4589	1.4589	1.4589
across_llumlet_latency	1.0083	1.0083	1.0083	1.0083	1.0083
across_engine_latency	0.3002	0.3002	0.3002	0.3002	0.3002
process_model_outputs_latency	0.0866	0.0808	0.1254	0.0755	0.1278
engine_step_latency	34.2241	34.1777	34.5893	33.9145	34.6088
step_postprocess_latency	0.0252	0.0121	0.1309	0.0114	0.1425
across_async_put_queue_thread_latency	0.0120	0.0115	0.0163	0.0111	0.0168
across_async_put_queue_actor_latency	0.1988	0.1983	0.2119	0.1860	0.2123
across_queue_client_latency	0.0343	0.0329	0.0442	0.0314	0.0451
queue_rpc_latency	0.2729	0.2667	0.3161	0.2578	0.3196
api_server_get_queue_latency	0.1013	0.0988	0.1151	0.0945	0.1153
across_request_streams_latency	0.0746	0.0592	0.1720	0.0564	0.1803

test_request_trace[zmq-engine_vLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	1.4931	1.4931	1.4931	1.4931	1.4931
across_llumlet_latency	1.0328	1.0328	1.0328	1.0328	1.0328
across_engine_latency	0.3102	0.3102	0.3102	0.3102	0.3102
process_model_outputs_latency	0.0844	0.0783	0.1166	0.0760	0.1181
engine_step_latency	34.1055	34.0860	34.3008	33.9027	34.3020
step_postprocess_latency	0.0254	0.0116	0.1313	0.0109	0.1422
across_async_put_queue_thread_latency	0.0112	0.0106	0.0160	0.0102	0.0164
across_async_put_queue_actor_latency	0.1935	0.1917	0.2078	0.1845	0.2079
across_queue_client_latency	0.0342	0.0330	0.0471	0.0302	0.0484
queue_rpc_latency	0.2757	0.2738	0.2968	0.2541	0.2971
api_server_get_queue_latency	0.1030	0.1033	0.1107	0.0963	0.1109
across_request_streams_latency	0.0757	0.0623	0.1612	0.0583	0.1689

github-actions · 2025-07-31T05:02:12Z

test_request_trace[rayqueue-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	1.2414	1.2414	1.2414	1.2414	1.2414
across_llumlet_latency	0.9572	0.9572	0.9572	0.9572	0.9572
across_engine_latency	0.1224	0.1224	0.1224	0.1224	0.1224
process_model_outputs_latency	0.4486	0.4298	0.5901	0.4194	0.6039
engine_step_latency	34.0027	33.8230	35.1724	33.7554	35.2779
step_postprocess_latency	0.0000	0.0000	0.0000	0.0000	0.0000
across_async_put_queue_thread_latency	0.0163	0.0162	0.0194	0.0141	0.0197
across_async_put_queue_actor_latency	0.0609	0.0493	0.1619	0.0372	0.1722
across_queue_client_latency	0.0403	0.0401	0.0490	0.0289	0.0490
queue_rpc_latency	0.2922	0.2854	0.3348	0.2800	0.3382
api_server_get_queue_latency	0.1900	0.1882	0.2029	0.1845	0.2037
across_request_streams_latency	0.0636	0.0503	0.1694	0.0491	0.1807

test_request_trace[zmq-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms)	mean	p50	p99	min	max
across_manager_latency	1.0727	1.0727	1.0727	1.0727	1.0727
across_llumlet_latency	0.8455	0.8455	0.8455	0.8455	0.8455
across_engine_latency	0.0946	0.0946	0.0946	0.0946	0.0946
process_model_outputs_latency	0.4244	0.3946	0.5439	0.3806	0.5449
engine_step_latency	33.9755	33.8088	35.1759	33.7202	35.2857
step_postprocess_latency	0.0000	0.0000	0.0000	0.0000	0.0000
across_async_put_queue_thread_latency	0.0140	0.0140	0.0151	0.0133	0.0151
across_async_put_queue_actor_latency	0.0366	0.0364	0.0397	0.0339	0.0398
across_queue_client_latency	0.0249	0.0246	0.0263	0.0240	0.0263
queue_rpc_latency	0.2149	0.2158	0.2403	0.1842	0.2404
api_server_get_queue_latency	0.1102	0.1078	0.1236	0.0970	0.1241
across_request_streams_latency	0.0417	0.0321	0.1134	0.0289	0.1208

github-actions · 2025-07-31T05:37:12Z

test_simple_benchmark[engine_BladeLLM-False-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	54.52	62.04	68.91	87.55	134.48	63.77
prefill	186.77	296.10	988.85	4019.20	5252.69	815.78

test_simple_benchmark[engine_BladeLLM-True-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	35.81	36.40	37.51	107.46	133.20	41.70
prefill	45.60	49.16	85.06	710.26	1507.55	166.68

test_simple_benchmark[engine_BladeLLM-False-rayqueue-False-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	56.25	65.30	72.96	88.10	128.76	66.57
prefill	188.73	299.83	618.45	1397.05	2434.60	484.91

test_simple_benchmark[engine_BladeLLM-False-zmq-True-False-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	35.58	36.02	36.72	37.79	38.00	36.15
prefill	142.68	161.51	196.72	835.75	1099.80	243.63

test_simple_benchmark[engine_BladeLLM-False-zmq-True-True-/mnt/model/Qwen2.5-7B]

latency(ms)	p25	p50	p75	p95	p99	mean
decode	35.06	35.46	36.10	36.73	37.53	35.56
prefill	170.26	207.42	349.56	1271.26	3064.82	437.91

s5u13b reviewed Jul 28, 2025

View reviewed changes

sjrrr13 force-pushed the ep_migration_failover branch 2 times, most recently from bc47dc2 to db3b6e6 Compare July 29, 2025 04:41

KuilongCui commented Jul 29, 2025

View reviewed changes

llumnix/dp_manager.py Outdated Show resolved Hide resolved

sjrrr13 and others added 13 commits July 31, 2025 12:32

add UnitState, DPManager set it for itself and Llumlet; Modify DPMana…

f31a27c

…ger heartbeat loop

move instance_info.UnitState to utils.UnitStatus; add log

a4c6dcc

add FailoverMigrationStatus but not finished

1766f88

add BrokenUnitFilter that used in DispatchScheduler; add unit test fo…

76c6741

…r it

ep unit migration

205b3ff

fix migration polivy

8eb9939

fix lint:

a963cfc

fix comment

d8065e3

fix migration

c034648

fix

205a4e4

fix lint

1229239

add UnitStatus.TERMINATED; modify DPManager.stop using heartbeat loop

72b1f58

modify some naming

81ffed7

sjrrr13 added 6 commits July 31, 2025 12:35

fix tiny bugs

db34985

fix CI: vllm_correctness_test pass

b7a49d5

rebase main

b6df7ac

add UnitStatus.MIGRATING; return failover_migration_tasks; fix ci bugs

3897380

fix lint error

476c05c

fix manager bug

73c6ee3

sjrrr13 force-pushed the ep_migration_failover branch from 3899158 to 73c6ee3 Compare July 31, 2025 04:40

[WIP][EP][Failover] Migrate out alive requests when part of ep unit is down #281

Are you sure you want to change the base?

[WIP][EP][Failover] Migrate out alive requests when part of ep unit is down #281

Uh oh!

Conversation

KuilongCui commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

s5u13b Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

sjrrr13 Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

sjrrr13 Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

s5u13b Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

KuilongCui Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

sjrrr13 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

s5u13b Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

sjrrr13 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

github-actions bot commented Jul 31, 2025

Uh oh!

github-actions bot commented Jul 31, 2025

Uh oh!

github-actions bot commented Jul 31, 2025

Uh oh!

github-actions bot commented Jul 31, 2025

Uh oh!

Uh oh!

KuilongCui commented Jul 24, 2025 •

edited

Loading

sjrrr13 Jul 29, 2025 •

edited

Loading