Skip to content

[WIP][EP][Failover] Migrate out alive requests when part of ep unit is down #281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

KuilongCui
Copy link
Contributor

@KuilongCui KuilongCui commented Jul 24, 2025

  1. failover migration P0 ok

  2. FAILOVER_MIGRATING not necessary ok

  3. maybe concurrency migration

  4. timeout for failover migration

  5. random.choice -> rr

migration_policy: MigrationPolicy) -> List[Tuple[str, str]]:
src_instance_infos, dst_instance_infos = self.migration_base_filter.filter_instances(instance_info.values())
migration_policy: MigrationPolicy,
skip_broken_unit: bool = True) -> List[Tuple[str, str]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add skip_broken_unit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For migration with MigrationType == FAILOVER_MIGRATION, we cannot filter out BROKEN instances since they are the source of migration.

Copy link
Contributor

@sjrrr13 sjrrr13 Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip_broken_unit is deleted.

except Exception as e:
log_instance_exception(e, dst_instance_id, "migrate_out", migrate_out_request.request_id)
migrated_request_list.extend(migrated_request)
if len(migrated_request) == 0 and migrate_out_request.eom:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is eom?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while self.has_migration_slot() and (not migrate_out_request.eom):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dst_instance_actor = self.instances[dst_instance_id]
asyncio.create_task(
asyncio_wait_for_ray_remote_call_with_timeout(
self.instances[src_instance_id].migrate_out,
dst_instance_actor, dst_instance_id, migration_type
)
)

if not exist_failover_migration_task:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not put these logic in push migrations? if len(failover_migration_tasks) >0, return empty normal_migratio_tasks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link

test_request_trace[rayqueue-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency -3034053628.3221 -3034053628.3221 -3034053628.3221 -3034053628.3221 -3034053628.3221
across_llumlet_latency 3034053629.6314 3034053629.6314 3034053629.6314 3034053629.6314 3034053629.6314
across_engine_latency 0.1503 0.1503 0.1503 0.1503 0.1503
process_model_outputs_latency 0.4304 0.3999 0.5778 0.3760 0.5819
engine_step_latency 34.0110 33.8785 34.9004 33.7532 34.9672
step_postprocess_latency 0.0000 0.0000 0.0000 0.0000 0.0000
across_async_put_queue_thread_latency 0.0158 0.0146 0.0226 0.0132 0.0229
across_async_put_queue_actor_latency 0.0380 0.0364 0.0533 0.0329 0.0546
across_queue_client_latency 0.0332 0.0337 0.0443 0.0271 0.0450
queue_rpc_latency 0.2946 0.2773 0.4441 0.2608 0.4576
api_server_get_queue_latency 0.2038 0.1944 0.2737 0.1865 0.2793
across_request_streams_latency 0.0742 0.0556 0.1875 0.0517 0.1971

test_request_trace[zmq-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency -3034193812.7334 -3034193812.7334 -3034193812.7334 -3034193812.7334 -3034193812.7334
across_llumlet_latency 3034193814.1497 3034193814.1497 3034193814.1497 3034193814.1497 3034193814.1497
across_engine_latency 0.1772 0.1772 0.1772 0.1772 0.1772
process_model_outputs_latency 0.4446 0.4095 0.5755 0.3886 0.5757
engine_step_latency 33.8888 33.7867 34.7845 33.7341 34.8798
step_postprocess_latency 0.0000 0.0000 0.0000 0.0000 0.0000
across_async_put_queue_thread_latency 0.0167 0.0153 0.0246 0.0142 0.0251
across_async_put_queue_actor_latency 0.0390 0.0358 0.0620 0.0338 0.0641
across_queue_client_latency 0.0332 0.0301 0.0521 0.0271 0.0536
queue_rpc_latency 0.2149 0.1925 0.3769 0.1747 0.3920
api_server_get_queue_latency 0.1122 0.1082 0.1623 0.0966 0.1668
across_request_streams_latency 0.0418 0.0316 0.1111 0.0271 0.1162

Copy link

test_request_trace[rayqueue-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency -3035159780.3036 -3035159780.3036 -3035159780.3036 -3035159780.3036 -3035159780.3036
across_llumlet_latency 3035159781.5923 3035159781.5923 3035159781.5923 3035159781.5923 3035159781.5923
across_engine_latency 0.1591 0.1591 0.1591 0.1591 0.1591
process_model_outputs_latency 0.4277 0.4055 0.5604 0.3815 0.5623
engine_step_latency 33.8438 33.7276 34.8042 33.6335 34.9053
step_postprocess_latency 0.0000 0.0000 0.0000 0.0000 0.0000
across_async_put_queue_thread_latency 0.0156 0.0146 0.0214 0.0137 0.0218
across_async_put_queue_actor_latency 0.0386 0.0370 0.0563 0.0326 0.0579
across_queue_client_latency 0.0357 0.0351 0.0523 0.0278 0.0539
queue_rpc_latency 0.2212 0.2074 0.3650 0.1799 0.3789
api_server_get_queue_latency 0.1128 0.1072 0.1628 0.0947 0.1669
across_request_streams_latency 0.0420 0.0316 0.1105 0.0275 0.1156

test_request_trace[zmq-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency -3035296725.9502 -3035296725.9502 -3035296725.9502 -3035296725.9502 -3035296725.9502
across_llumlet_latency 3035296727.1860 3035296727.1860 3035296727.1860 3035296727.1860 3035296727.1860
across_engine_latency 0.1527 0.1527 0.1527 0.1527 0.1527
process_model_outputs_latency 0.4426 0.4181 0.5654 0.3890 0.5686
engine_step_latency 33.8631 33.7474 34.7569 33.7326 34.8497
step_postprocess_latency 0.0000 0.0000 0.0000 0.0000 0.0000
across_async_put_queue_thread_latency 0.0159 0.0148 0.0223 0.0140 0.0228
across_async_put_queue_actor_latency 0.0397 0.0375 0.0537 0.0347 0.0543
across_queue_client_latency 0.0362 0.0335 0.0507 0.0305 0.0511
queue_rpc_latency 0.2120 0.1942 0.3442 0.1768 0.3553
api_server_get_queue_latency 0.1116 0.1061 0.1604 0.0924 0.1638
across_request_streams_latency 0.0424 0.0314 0.1104 0.0281 0.1156

Copy link

test_request_trace[rayqueue-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency -3055926918.4997 -3055926918.4997 -3055926918.4997 -3055926918.4997 -3055926918.4997
across_llumlet_latency 3055926919.7197 3055926919.7197 3055926919.7197 3055926919.7197 3055926919.7197
across_engine_latency 0.1514 0.1514 0.1514 0.1514 0.1514
process_model_outputs_latency 0.4449 0.4158 0.5958 0.3902 0.5994
engine_step_latency 34.0164 33.9100 34.9387 33.7473 35.0316
step_postprocess_latency 0.0000 0.0000 0.0000 0.0000 0.0000
across_async_put_queue_thread_latency 0.0161 0.0151 0.0233 0.0142 0.0240
across_async_put_queue_actor_latency 0.0482 0.0447 0.0583 0.0423 0.0584
across_queue_client_latency 0.0555 0.0542 0.0631 0.0537 0.0637
queue_rpc_latency 0.2715 0.2545 0.4056 0.2384 0.4181
api_server_get_queue_latency 0.1136 0.1046 0.1651 0.0975 0.1688
across_request_streams_latency 0.0424 0.0290 0.1236 0.0270 0.1302

test_request_trace[zmq-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency -3056063872.9079 -3056063872.9079 -3056063872.9079 -3056063872.9079 -3056063872.9079
across_llumlet_latency 3056063874.1515 3056063874.1515 3056063874.1515 3056063874.1515 3056063874.1515
across_engine_latency 0.1526 0.1526 0.1526 0.1526 0.1526
process_model_outputs_latency 0.4304 0.4045 0.5678 0.3795 0.5712
engine_step_latency 34.0194 33.9190 34.9072 33.7869 34.9954
step_postprocess_latency 0.0000 0.0000 0.0000 0.0000 0.0000
across_async_put_queue_thread_latency 0.0151 0.0144 0.0209 0.0134 0.0213
across_async_put_queue_actor_latency 0.0456 0.0432 0.0585 0.0412 0.0588
across_queue_client_latency 0.0564 0.0544 0.0695 0.0529 0.0703
queue_rpc_latency 0.2741 0.2513 0.4349 0.2348 0.4499
api_server_get_queue_latency 0.1162 0.1083 0.1863 0.0958 0.1923
across_request_streams_latency 0.0437 0.0318 0.1229 0.0269 0.1292

@sjrrr13 sjrrr13 force-pushed the ep_migration_failover branch 2 times, most recently from bc47dc2 to db3b6e6 Compare July 29, 2025 04:41
Copy link

test_request_trace[rayqueue-engine_vLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency 1.8060 1.8060 1.8060 1.8060 1.8060
across_llumlet_latency 0.9921 0.9921 0.9921 0.9921 0.9921
across_engine_latency 0.3012 0.3012 0.3012 0.3012 0.3012
process_model_outputs_latency 0.0844 0.0808 0.1077 0.0763 0.1087
engine_step_latency 34.1148 34.1174 34.4124 33.8346 34.4337
step_postprocess_latency 0.0229 0.0120 0.1092 0.0113 0.1186
across_async_put_queue_thread_latency 0.0115 0.0116 0.0123 0.0107 0.0123
across_async_put_queue_actor_latency 0.1782 0.1912 0.2043 0.0494 0.2048
across_queue_client_latency 0.0348 0.0341 0.0423 0.0320 0.0429
queue_rpc_latency 0.2910 0.2879 0.3182 0.2728 0.3191
api_server_get_queue_latency 0.1076 0.1053 0.1208 0.1031 0.1215
across_request_streams_latency 0.0794 0.0647 0.1713 0.0628 0.1798

test_request_trace[zmq-engine_vLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency 1.7507 1.7507 1.7507 1.7507 1.7507
across_llumlet_latency 1.6250 1.6250 1.6250 1.6250 1.6250
across_engine_latency 0.3102 0.3102 0.3102 0.3102 0.3102
process_model_outputs_latency 0.0850 0.0778 0.1246 0.0734 0.1272
engine_step_latency 34.2608 34.2193 35.0039 33.9132 35.0704
step_postprocess_latency 0.0267 0.0126 0.1443 0.0107 0.1572
across_async_put_queue_thread_latency 0.0120 0.0117 0.0144 0.0110 0.0146
across_async_put_queue_actor_latency 0.1968 0.1962 0.2223 0.1769 0.2231
across_queue_client_latency 0.0343 0.0318 0.0441 0.0282 0.0445
queue_rpc_latency 0.2786 0.2748 0.3189 0.2574 0.3205
api_server_get_queue_latency 0.1025 0.1021 0.1093 0.0970 0.1093
across_request_streams_latency 0.0759 0.0649 0.1633 0.0581 0.1708

Copy link

test_simple_benchmark[engine_BladeLLM-False-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 56.51 64.60 73.29 90.35 122.94 66.47
prefill 166.55 245.46 498.04 1984.95 3784.55 521.08

test_simple_benchmark[engine_BladeLLM-True-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 35.95 36.27 37.24 65.45 120.72 40.27
prefill 49.41 79.79 114.47 680.72 1890.38 185.53

test_simple_benchmark[engine_BladeLLM-False-rayqueue-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 55.63 63.71 71.76 90.07 132.40 65.31
prefill 188.87 277.12 708.43 1834.10 4315.46 589.60

test_simple_benchmark[engine_BladeLLM-False-zmq-True-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 35.61 36.01 36.85 37.73 37.98 36.20
prefill 134.92 151.98 199.46 828.04 1574.44 252.86

test_simple_benchmark[engine_BladeLLM-False-zmq-True-True-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 35.13 35.53 35.84 38.00 38.34 35.67
prefill 144.34 184.69 345.82 1242.56 3003.89 408.95

Copy link

test_simple_benchmark[engine_vLLM-False-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 60.91 68.67 75.49 103.61 201.16 72.70
prefill 195.95 1433.19 16908.96 36904.28 49573.28 9761.00

test_simple_benchmark[engine_vLLM-False-rayqueue-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 57.12 68.75 74.47 95.62 146.90 67.00
prefill 334.81 5977.67 20329.74 38403.65 42062.05 11286.93

Copy link

test_request_trace[rayqueue-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency 1.0824 1.0824 1.0824 1.0824 1.0824
across_llumlet_latency 0.8978 0.8978 0.8978 0.8978 0.8978
across_engine_latency 0.1143 0.1143 0.1143 0.1143 0.1143
process_model_outputs_latency 0.4208 0.4066 0.5435 0.3894 0.5548
engine_step_latency 34.0240 33.8174 35.4233 33.7522 35.5450
step_postprocess_latency 0.0000 0.0000 0.0000 0.0000 0.0000
across_async_put_queue_thread_latency 0.0147 0.0144 0.0171 0.0137 0.0172
across_async_put_queue_actor_latency 0.0345 0.0336 0.0417 0.0312 0.0422
across_queue_client_latency 0.0294 0.0289 0.0356 0.0247 0.0360
queue_rpc_latency 0.2013 0.1953 0.2453 0.1825 0.2479
api_server_get_queue_latency 0.1092 0.1079 0.1291 0.0989 0.1303
across_request_streams_latency 0.0415 0.0318 0.1192 0.0275 0.1270

test_request_trace[zmq-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency 1.0955 1.0955 1.0955 1.0955 1.0955
across_llumlet_latency 0.8261 0.8261 0.8261 0.8261 0.8261
across_engine_latency 0.1053 0.1053 0.1053 0.1053 0.1053
process_model_outputs_latency 0.4207 0.4098 0.5430 0.3915 0.5551
engine_step_latency 34.0783 33.8866 35.3813 33.7608 35.4922
step_postprocess_latency 0.0000 0.0000 0.0000 0.0000 0.0000
across_async_put_queue_thread_latency 0.0146 0.0147 0.0160 0.0136 0.0161
across_async_put_queue_actor_latency 0.0371 0.0368 0.0465 0.0325 0.0471
across_queue_client_latency 0.0311 0.0315 0.0354 0.0273 0.0355
queue_rpc_latency 0.1945 0.1939 0.2215 0.1741 0.2218
api_server_get_queue_latency 0.1069 0.1034 0.1300 0.0966 0.1315
across_request_streams_latency 0.0403 0.0305 0.1118 0.0281 0.1190

Copy link

test_request_trace[rayqueue-engine_vLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency 1.5590 1.5590 1.5590 1.5590 1.5590
across_llumlet_latency 1.0896 1.0896 1.0896 1.0896 1.0896
across_engine_latency 0.3316 0.3316 0.3316 0.3316 0.3316
process_model_outputs_latency 0.0886 0.0838 0.1096 0.0786 0.1099
engine_step_latency 34.1213 34.1293 34.2591 33.8835 34.2592
step_postprocess_latency 0.0205 0.0117 0.0907 0.0113 0.0983
across_async_put_queue_thread_latency 0.0112 0.0113 0.0116 0.0105 0.0116
across_async_put_queue_actor_latency 0.1819 0.1969 0.2123 0.0419 0.2133
across_queue_client_latency 0.0375 0.0317 0.0842 0.0284 0.0889
queue_rpc_latency 0.2699 0.2646 0.3250 0.2452 0.3286
api_server_get_queue_latency 0.1044 0.1027 0.1181 0.0958 0.1189
across_request_streams_latency 0.0769 0.0644 0.1611 0.0575 0.1680

test_request_trace[zmq-engine_vLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency 1.3815 1.3815 1.3815 1.3815 1.3815
across_llumlet_latency 1.0345 1.0345 1.0345 1.0345 1.0345
across_engine_latency 0.2964 0.2964 0.2964 0.2964 0.2964
process_model_outputs_latency 0.1029 0.0993 0.1235 0.0847 0.1237
engine_step_latency 34.1323 34.1459 34.3852 33.8193 34.3901
step_postprocess_latency 0.0250 0.0131 0.1150 0.0119 0.1244
across_async_put_queue_thread_latency 0.0128 0.0117 0.0217 0.0109 0.0226
across_async_put_queue_actor_latency 0.1504 0.1917 0.2177 0.0383 0.2193
across_queue_client_latency 0.0298 0.0296 0.0338 0.0255 0.0338
queue_rpc_latency 0.4182 0.4551 0.5910 0.2544 0.5958
api_server_get_queue_latency 0.2156 0.1875 0.3554 0.0965 0.3566
across_request_streams_latency 0.1632 0.0787 0.5011 0.0587 0.5250

Copy link

test_simple_benchmark[engine_vLLM-False-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 62.01 67.51 73.42 98.08 162.50 68.94
prefill 168.60 4770.14 18232.15 35240.11 55755.38 10831.87

test_simple_benchmark[engine_vLLM-False-rayqueue-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 58.22 69.00 76.27 107.79 225.57 71.00
prefill 268.08 5894.85 19691.37 42098.05 59137.71 12104.35

Copy link

test_request_trace[rayqueue-engine_vLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency 1.3540 1.3540 1.3540 1.3540 1.3540
across_llumlet_latency 1.0604 1.0604 1.0604 1.0604 1.0604
across_engine_latency 0.3436 0.3436 0.3436 0.3436 0.3436
process_model_outputs_latency 0.0808 0.0761 0.1041 0.0741 0.1048
engine_step_latency 34.1942 34.1723 34.5303 33.9361 34.5468
step_postprocess_latency 0.0200 0.0119 0.0855 0.0109 0.0926
across_async_put_queue_thread_latency 0.0124 0.0112 0.0218 0.0107 0.0228
across_async_put_queue_actor_latency 0.1829 0.1999 0.2094 0.0438 0.2095
across_queue_client_latency 0.0336 0.0334 0.0357 0.0322 0.0357
queue_rpc_latency 0.4158 0.4602 0.5599 0.2399 0.5603
api_server_get_queue_latency 0.1832 0.1471 0.3261 0.0914 0.3265
across_request_streams_latency 0.0968 0.0683 0.2205 0.0564 0.2239

test_request_trace[zmq-engine_vLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency 1.3320 1.3320 1.3320 1.3320 1.3320
across_llumlet_latency 0.9918 0.9918 0.9918 0.9918 0.9918
across_engine_latency 0.3323 0.3323 0.3323 0.3323 0.3323
process_model_outputs_latency 0.1059 0.1016 0.1457 0.0855 0.1484
engine_step_latency 34.2355 34.1819 34.6641 33.9145 34.6893
step_postprocess_latency 0.0251 0.0126 0.1249 0.0120 0.1358
across_async_put_queue_thread_latency 0.0130 0.0127 0.0152 0.0120 0.0154
across_async_put_queue_actor_latency 0.2082 0.2069 0.2201 0.1979 0.2203
across_queue_client_latency 0.0482 0.0377 0.0999 0.0356 0.1016
queue_rpc_latency 0.3373 0.2928 0.5674 0.2682 0.5805
api_server_get_queue_latency 0.1383 0.1073 0.3623 0.0970 0.3850
across_request_streams_latency 0.0805 0.0687 0.1591 0.0579 0.1658

Copy link

test_simple_benchmark[engine_vLLM-False-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 62.30 67.72 70.84 93.47 139.50 69.16
prefill 151.59 2845.35 18832.33 36176.36 40136.75 10410.02

test_simple_benchmark[engine_vLLM-False-rayqueue-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 57.43 68.00 73.54 95.67 138.05 65.84
prefill 340.68 6365.50 19670.37 40144.11 50253.23 11934.69

Copy link

test_simple_benchmark[engine_BladeLLM-False-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 56.15 63.87 71.24 88.51 127.27 64.85
prefill 170.98 239.94 459.12 1225.61 3528.13 438.22

test_simple_benchmark[engine_BladeLLM-True-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 35.80 36.53 37.43 45.94 116.78 39.57
prefill 75.20 97.36 144.94 1668.87 2066.53 271.66

test_simple_benchmark[engine_BladeLLM-False-rayqueue-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 56.06 64.67 72.65 84.03 137.21 67.77
prefill 182.41 270.73 559.64 1482.42 3238.30 498.91

test_simple_benchmark[engine_BladeLLM-False-zmq-True-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 35.61 35.96 36.69 37.66 37.78 36.17
prefill 129.49 143.27 186.47 620.62 1015.73 233.41

test_simple_benchmark[engine_BladeLLM-False-zmq-True-True-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 35.13 35.57 36.21 36.82 37.53 35.59
prefill 168.89 210.65 368.06 1293.10 3079.67 442.76

Copy link

test_request_trace[rayqueue-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency 1.1253 1.1253 1.1253 1.1253 1.1253
across_llumlet_latency 0.8647 0.8647 0.8647 0.8647 0.8647
across_engine_latency 0.1144 0.1144 0.1144 0.1144 0.1144
process_model_outputs_latency 0.4142 0.3949 0.5445 0.3889 0.5570
engine_step_latency 33.9499 33.7545 35.1480 33.6766 35.2496
step_postprocess_latency 0.0000 0.0000 0.0000 0.0000 0.0000
across_async_put_queue_thread_latency 0.0140 0.0138 0.0160 0.0134 0.0161
across_async_put_queue_actor_latency 0.0371 0.0375 0.0434 0.0328 0.0438
across_queue_client_latency 0.0348 0.0340 0.0434 0.0279 0.0437
queue_rpc_latency 0.2040 0.1945 0.2504 0.1806 0.2527
api_server_get_queue_latency 0.1074 0.1061 0.1234 0.0955 0.1239
across_request_streams_latency 0.0405 0.0315 0.1161 0.0266 0.1237

test_request_trace[zmq-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency 1.1066 1.1066 1.1066 1.1066 1.1066
across_llumlet_latency 0.8562 0.8562 0.8562 0.8562 0.8562
across_engine_latency 0.0982 0.0982 0.0982 0.0982 0.0982
process_model_outputs_latency 0.6242 0.4310 2.0109 0.4017 2.1470
engine_step_latency 34.0044 33.8200 35.2457 33.7254 35.3562
step_postprocess_latency 0.0000 0.0000 0.0000 0.0000 0.0000
across_async_put_queue_thread_latency 0.0172 0.0153 0.0298 0.0146 0.0309
across_async_put_queue_actor_latency 0.0427 0.0392 0.0654 0.0378 0.0674
across_queue_client_latency 0.0369 0.0362 0.0527 0.0278 0.0540
queue_rpc_latency 0.2037 0.1950 0.2469 0.1793 0.2484
api_server_get_queue_latency 0.1043 0.1013 0.1203 0.0938 0.1209
across_request_streams_latency 0.0386 0.0302 0.1049 0.0262 0.1116

@sjrrr13 sjrrr13 force-pushed the ep_migration_failover branch from 3899158 to 73c6ee3 Compare July 31, 2025 04:40
Copy link

test_simple_benchmark[engine_vLLM-False-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 61.24 67.05 72.39 92.54 135.68 67.19
prefill 172.81 4483.88 17684.44 43440.59 49127.17 10057.68

test_simple_benchmark[engine_vLLM-False-rayqueue-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 57.60 68.45 76.25 104.40 166.37 68.98
prefill 342.97 5723.20 18617.89 43225.41 48430.32 10827.78

Copy link

test_request_trace[rayqueue-engine_vLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency 1.4589 1.4589 1.4589 1.4589 1.4589
across_llumlet_latency 1.0083 1.0083 1.0083 1.0083 1.0083
across_engine_latency 0.3002 0.3002 0.3002 0.3002 0.3002
process_model_outputs_latency 0.0866 0.0808 0.1254 0.0755 0.1278
engine_step_latency 34.2241 34.1777 34.5893 33.9145 34.6088
step_postprocess_latency 0.0252 0.0121 0.1309 0.0114 0.1425
across_async_put_queue_thread_latency 0.0120 0.0115 0.0163 0.0111 0.0168
across_async_put_queue_actor_latency 0.1988 0.1983 0.2119 0.1860 0.2123
across_queue_client_latency 0.0343 0.0329 0.0442 0.0314 0.0451
queue_rpc_latency 0.2729 0.2667 0.3161 0.2578 0.3196
api_server_get_queue_latency 0.1013 0.0988 0.1151 0.0945 0.1153
across_request_streams_latency 0.0746 0.0592 0.1720 0.0564 0.1803

test_request_trace[zmq-engine_vLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency 1.4931 1.4931 1.4931 1.4931 1.4931
across_llumlet_latency 1.0328 1.0328 1.0328 1.0328 1.0328
across_engine_latency 0.3102 0.3102 0.3102 0.3102 0.3102
process_model_outputs_latency 0.0844 0.0783 0.1166 0.0760 0.1181
engine_step_latency 34.1055 34.0860 34.3008 33.9027 34.3020
step_postprocess_latency 0.0254 0.0116 0.1313 0.0109 0.1422
across_async_put_queue_thread_latency 0.0112 0.0106 0.0160 0.0102 0.0164
across_async_put_queue_actor_latency 0.1935 0.1917 0.2078 0.1845 0.2079
across_queue_client_latency 0.0342 0.0330 0.0471 0.0302 0.0484
queue_rpc_latency 0.2757 0.2738 0.2968 0.2541 0.2971
api_server_get_queue_latency 0.1030 0.1033 0.1107 0.0963 0.1109
across_request_streams_latency 0.0757 0.0623 0.1612 0.0583 0.1689

Copy link

test_request_trace[rayqueue-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency 1.2414 1.2414 1.2414 1.2414 1.2414
across_llumlet_latency 0.9572 0.9572 0.9572 0.9572 0.9572
across_engine_latency 0.1224 0.1224 0.1224 0.1224 0.1224
process_model_outputs_latency 0.4486 0.4298 0.5901 0.4194 0.6039
engine_step_latency 34.0027 33.8230 35.1724 33.7554 35.2779
step_postprocess_latency 0.0000 0.0000 0.0000 0.0000 0.0000
across_async_put_queue_thread_latency 0.0163 0.0162 0.0194 0.0141 0.0197
across_async_put_queue_actor_latency 0.0609 0.0493 0.1619 0.0372 0.1722
across_queue_client_latency 0.0403 0.0401 0.0490 0.0289 0.0490
queue_rpc_latency 0.2922 0.2854 0.3348 0.2800 0.3382
api_server_get_queue_latency 0.1900 0.1882 0.2029 0.1845 0.2037
across_request_streams_latency 0.0636 0.0503 0.1694 0.0491 0.1807

test_request_trace[zmq-engine_BladeLLM-/mnt/model/Qwen2.5-7B]

latency(ms) mean p50 p99 min max
across_manager_latency 1.0727 1.0727 1.0727 1.0727 1.0727
across_llumlet_latency 0.8455 0.8455 0.8455 0.8455 0.8455
across_engine_latency 0.0946 0.0946 0.0946 0.0946 0.0946
process_model_outputs_latency 0.4244 0.3946 0.5439 0.3806 0.5449
engine_step_latency 33.9755 33.8088 35.1759 33.7202 35.2857
step_postprocess_latency 0.0000 0.0000 0.0000 0.0000 0.0000
across_async_put_queue_thread_latency 0.0140 0.0140 0.0151 0.0133 0.0151
across_async_put_queue_actor_latency 0.0366 0.0364 0.0397 0.0339 0.0398
across_queue_client_latency 0.0249 0.0246 0.0263 0.0240 0.0263
queue_rpc_latency 0.2149 0.2158 0.2403 0.1842 0.2404
api_server_get_queue_latency 0.1102 0.1078 0.1236 0.0970 0.1241
across_request_streams_latency 0.0417 0.0321 0.1134 0.0289 0.1208

Copy link

test_simple_benchmark[engine_BladeLLM-False-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 54.52 62.04 68.91 87.55 134.48 63.77
prefill 186.77 296.10 988.85 4019.20 5252.69 815.78

test_simple_benchmark[engine_BladeLLM-True-zmq-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 35.81 36.40 37.51 107.46 133.20 41.70
prefill 45.60 49.16 85.06 710.26 1507.55 166.68

test_simple_benchmark[engine_BladeLLM-False-rayqueue-False-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 56.25 65.30 72.96 88.10 128.76 66.57
prefill 188.73 299.83 618.45 1397.05 2434.60 484.91

test_simple_benchmark[engine_BladeLLM-False-zmq-True-False-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 35.58 36.02 36.72 37.79 38.00 36.15
prefill 142.68 161.51 196.72 835.75 1099.80 243.63

test_simple_benchmark[engine_BladeLLM-False-zmq-True-True-/mnt/model/Qwen2.5-7B]

latency(ms) p25 p50 p75 p95 p99 mean
decode 35.06 35.46 36.10 36.73 37.53 35.56
prefill 170.26 207.42 349.56 1271.26 3064.82 437.91

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants