Open
Description
For streaming case, we cannot clamp the generated tokens and recompute them.
Moreover, since the clamping logic is done in the worker but not in the main process, the discrepancy arises between the main and the worker process. See #158 and #164.
We need to either
- Require that generation never grows beyond
max_num_batched_tokens
- Split recovering of such requires into multiple batch, using the new
evaluate_multi_query
function from Add new Relax function to the batched model for evaluating query tokens over multiple time steps in parallel #156