-
Notifications
You must be signed in to change notification settings - Fork 737
Open
Description
Hi there,
I don't notice any speed-up from the Visual Resolution Router that should be enabled in the flash model
I set up with the suggested route as per here
model = AutoModel.from_pretrained(model_path, **model_kwargs).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
But I don't see the inference speed boost in any of my testing. Are there particular contexts the inference speed shines within? Is there a configuration flag I need to switch to enable the ViR?
================================================================================
TEST 2: Batch Size 16
================================================================================
┌─────────────────────────────────────────┬──────────────────┬──────────────────┐
│ Metric │ Flash Model │ Standard Model │
├─────────────────────────────────────────┼──────────────────┼──────────────────┤
│ Batch Time (ms) │ 4500.43 │ 4478.83 │
│ Time per Item (ms) │ 900.09 │ 895.77 │
│ Throughput (items/sec) │ 1.11 │ 1.12 │
│ Peak Memory (MB) │ 2281.94 │ 2109.76 │
└─────────────────────────────────────────┴──────────────────┴──────────────────┘
Speedup (Flash vs Standard): 1.00x
Throughput Improvement: -0.5%
Memory Reduction: -8.2%
Metadata
Metadata
Assignees
Labels
No labels