You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tokenized input request is split into blocks and hash value of the blocks is cached for future match. Size of the block (i.e. number of tokens per block) defines how effective prefix match will be. Default is <ins>**_character tokenizer and 128 block size (tokens per block)_**</ins>.
85
+
86
+
| Tokenizer Type | Block Size Recommendation |
87
+
| ------------- | ------------- |
88
+
| character | 128 |
89
+
| tiktoken | 16 |
90
+
91
+
- **AIBRIX_PREFIX_CACHE_BLOCK_NUMBER**
92
+
93
+
Maximum number of prefix cache blocks. Default is <ins>**_200000_**</ins>.
Before evaluating prefix cache match, router checks if there is imbalance of running requests across pods. Imbalance is measured using absolute difference between max & min running requests across pods, forexample if imbalance_abs_count = 16 and running requests for pods are [p1: 1, p2: 2, p3:20] then current scenario is flagged as imbalanced. If flagged as imbalanced then prefix match is ignored and request is routed to pod with least running requests whichin above example will to route to pod p1. Default is <ins>**_16_**</ins> and should be adjusted based on GPU hardware & prompt length.
After evaluating prefix match, pods are selected with matching prefix cache. Selected pods are re-evaluated to prevent a hotspot scenario where bulk of prefix matching requests are routed to same pod. Imbalanced is checked as follows
102
+
<pre>
103
+
prefix_match_pod.running_requests <= mean + <b>load_factor</b>* standard_deviation
104
+
</pre>
105
+
106
+
**load_factor** determines number of standard deviations. Default is <ins>**_2_**</ins>
0 commit comments