Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serving] Support NVTX for benchmarking #2043

Merged
merged 1 commit into from
Mar 28, 2024

Conversation

MasterJH5574
Copy link
Member

This PR supports MLC serve with NVTX which helps analyzing benchmarking results.

Note. To enable NVTX, please add set(USE_NVTX ON) to file build/config.cmake.

@MasterJH5574 MasterJH5574 marked this pull request as ready for review March 28, 2024 02:25
This PR supports MLC serve with NVTX which helps analyzing benchmarking
results.

**Note.** To enable NVTX, please add `set(USE_NVTX ON)` to file
`build/config.cmake`.
@tqchen tqchen merged commit 4255a45 into mlc-ai:main Mar 28, 2024
2 checks passed
sunggg added a commit to octoml/mlc-llm that referenced this pull request Apr 22, 2024
… April 18th 2024) (#260)

* [Attn] Making decode attn kernel be aware of webgpu target (#1817)

This PR enables the decode attn kernel to have awareness of
the webgpu backend, so that it helps make sure the total
number of threads does not exceed the 256 limit of WebGPU.

Co-authored-by: Bohan Hou <[email protected]>

* [Serving][Refactor] Logit processor and logit bias support (#1828)

This PR refactors the existing logit processing pipeline
with a unfiied logit processor class. The logit processor class
exposes two functions:
- `InplaceUpdateLogits`, which takes in the raw logits produced
by the model, and apply logit bias (which is introduced in this PR),
presence/frequency/repetition penalties, and token id mask in
order when needed.
- `ComputeProbsFromLogits`, which takes in the updated logits,
and invoke softmax with temperature to compute the probability
distribution.

The logit processor completely runs on GPU. This being said,
all the logit bias / penalty / mask application and the softmax
is backed by GPU kernels. This is a highlight difference compared
with the logit processing prior to this PR, where the processing
happens on CPU, and softmax also happens on CPU when any logit
process is needed.

With the unified logit processor, we simplified the interface
of handling model's output logits in engine actions to make it
cleaner. We also simplified the interface of Sampler.

Preliminary results show that LogitProcessor brings a bit perf
improvement when any processing is needed.

* [Serving][Grammar] BNF grammar simplifier and matcher (#1801)

* [Serving] LogProbs support (#1832)

This PR introduces the logprobs support with OpenAI API
compatibility. It enhances the sampler with a function to get
the top-probability tokens (supporting 5 tokens at most as of now).

To make it easy to pass logprob results back from serving engine
to frontend, we choose to pass logprob results in JSON string with
OpenAI API spec.

Unit tests are added to ensure the correctness of logprobs.
And the logprobs support also work with speculative decoding.

* [Serving] Support Mixtral in MLC Serve (#1840)

This PR supports Mixtral in MLC serve. The main thing is only
introducing the Mistral conversation template to Python registry
so that MLC Serve can use.

Besides that, this PR updates the KV cache capacity analysis to
make it more accurate in terms of usage calculation, while being
conservative since there is a known issue regarding batch-prefill
embedding taking which may lead to OOM. We will reset the follow up
on the issue with a fix in the future and then enable the estimation
to use more GPU vRAM.

* [Fix] Fix `u_char` for Windows build (#1848)

Prior to this PR, `u_char` was used while it is not a standard
type in C++, which causes Windows build failure.

This PR fixes it by using `unsigned char`.

* Auto updated submodule references

* [Fix] Add phi lm head name to is_final_fc, add q4f16_ft to CI (#1849)

[Fix] Add phi lm head name to is_final_fc

* [Build] Replace mod_transform_before_build with IRModule pass (#1852)

Instead of a python function that returns an updated `IRModule`, the
new `optimize_mod_pipeline` function returns a `tvm.ir.transform.Pass`
which can be applied to an `IRModule`.

* [SLM] Add support for InternLM architecture (#1835)

* Create __init__.py

* Add files via upload

* Update model.py

* Update model_preset.py

* Update conv_templates.cc

* Update internlm_loader.py

* Update internlm_quantization.py

* fix name of notes

* Update model.py

* Migration

* fix pylint issue

* fix pylint issue

* fix pylint error

* Update internlm_loader.py

* Update __init__.py

* Update __init__.py

* Delete python/mlc_chat/model/internlm/__init__.py

* Add files via upload

* [Bugfix] Handle model names with multiple path components (#1851)

Prior to this commit, a model name with multiple path
components (e.g. `dist/models/group_name/model_name`) would have
duplicated path components
(e.g. `dist/group_name/artifact_path/group_name/libname.so`).
This commit resolves the duplication.

* [KVCache] Add max num threads awareness to KVCache kernels (#1822)

* [KVCache] Add max num threads to KVCache kernels, fix WebGPU

* Read max_num_threads_per_block when available

* Change merge state in place kernel

* Make attention decode aware of max num threads, not just webgpu

Co-authored-by: Egor Churaev <[email protected]>

* Change util function name

---------

Co-authored-by: Egor Churaev <[email protected]>

* [KVCache] Migrate Baichuan model to PagedKVCache (#1854)

* [Python] Lazy import of transformers for tiktoken conversion (#1860)

This PR moves the import of transformers into the function body
of tiktoken tokenizer conversion, so we do not have a force dependency
on transformers.

* [SLM] RWKV5 World Support (#1787)

This PR adds RWKV5 support with RNNState, a similar interface as
PagedAttention.

Co-authored-by: Xiaoyu Zhang <[email protected]>

* [Serving] Register the ChatML conversation template (#1862)

Following #1854 , this pr registers the ChatML conversation template.

* [Utils][Transform] Added SetEntryFuncs transform (#1855)

Sets the entry functions for a module.  This utility is intended for
cases where only module contains several externally-exposed functions,
and only one is desired for use.  (e.g. Separating out a
`transform_params` function from an `IRModule` that also contains
inference functions.)  This commit only updates the external
visibility, after which `relax.transform.DeadCodeElimination()` can be
applied.

* [Build] Update transform_params_for_each_rank to IRModule pass (#1856)

This allows it to be used as part of a optimization pipeline specified
as a `tvm.ir.transform.Sequential`.

* [Serving][Grammar] Integrate JSON grammar into the generation pipeline (#1867)

This PR is the 3rd part of the grammar-guided generation.
This intregrates the grammar framework into the generation
process, and supports JSON output for now.

The API this PR provides is compatible with the OpenAI api.

### APIs
#### Python API
```
@dataclass
class ResponseFormat:
    type: Literal["text", "json_object"] = "text"
    json_schema: Optional[str] = None

@dataclass
class GenerationConfig:
        response_format: ResponseFormat = ResponseFormat(type="text")
```

#### Rest API
```
response_format: { "type": "text" } # text generation, by default
response_format: { "type": "json_object" } # json generation
response_format: { "type": "json_object", json_schema="..."} # json generation with schema
```

JSON generation with schema is not supported yet,
but has been planned to be realized in the future.

### Performance
#### Without JSON
```
Single token prefill latency: 891.2234 ms/tok
Single token decode latency: 31.3399 ms/tok
Prefill token throughput: 4693.3077 tok/s
Decode token throughput: 226.4406 tok/s
Overall token throughput: 470.3180 tok/s
```
#### With JSON
```
Single token prefill latency: 219.2287 ms/tok
Single token decode latency: 29.1399 ms/tok
Prefill token throughput: 7392.1555 tok/s
Decode token throughput: 179.2296 tok/s
Overall token throughput: 1052.1996 tok/s
```

We observed a slight decrease in performance under JSON mode.
This will be further optimized in the future.

* [Serving] Support "n" for parallel generation (#1868)

This PR brings field `n` to generation config and thereby
supports parallel generation. This parallel generation effectively
leverages the "fork" functionality of paged KV cache.

This PR supports specifying the number of parallel generation
`n` in stardard OpenAI ChatCompletion API. This is the last
feature towards the OpenAI API feature completeness.

* [CI] Add retry to scm checkout (#1869)

Sometimes scm checkout can timeout, this PR add retry to that

* [Attn] Use float32 accumulation in attention kernel (#1870)

Prior to this PR, the TIR attention kernels does not cast matmul
operands to fp32 before multiplying.
For models like Phi-2 which may have large Q/K/V data (at the level
of a few hundreds), the fp16 multiplication exceeds the range of
fp16, and lead to attention result being NAN sometimes.

This PR fixes this issue.

* [Utils] Allow ReorderTransformFunc to be used without param manager (#1857)

Prior to this commit, the `ReorderTransformFunc` required several
components of the `ParamManager` to use.  The functionality it
provides, reordering dataflow blocks to minimize the liveset, is
useful outside of the context of the `ParamManager`.  This commit
makes the following changes, allowing it to be used independently of
the `ParamManager`.

- Generate the `pidx2binname` dictionary outside of `ReorderTransformFunc`

- Allow parameters to be separate `func.params`, rather than a single
  bundled tuple parameter.

* [SLM] Migrate Phi-2 to paged KV Cache #1871 (#1872)

This PR migrates Phi-2 for Paged KV cache Attention as a part of Model definition migration according to #1749 .

Co-authored-by: Shrey Gupta <[email protected]>

* [Fix] Fix the use of "call_inplace_packed" and "call_pure_packed" (#1874)

The use of `call_inplace_packed` and `call_pure_packed` in the old
flow is outdated due to signature changes. This PR fixes the issue.

* [Fix] Add the missing BundleModelParams pass (#1875)

PR #1852 missed to apply the BundleModelParams pass and thus made
the compiled models not runnable through ChatModule (#1864). This PR
fixes the issue.

* [Docs] Update Android APK download link (#1876)

As pointed out by #1830, this PR fixes the Android app download
link in docs.

* Fix MLC-LLM website link weight convert not accessible (#1877)

Fix website link not accessible

* [Serving][Grammar] Support termination state in GrammarStateMatcher (#1884)

* [Serving] Make RequestState as a standalone object class (#1878)

This PR adopts suggestions from the support of OpenAI API parallel
generation `n` in #1868. The main update in this PR is to make
the RequestState as a standalone object class, which was a typedef
from `std::vector<RequestStateEntry>` before.

This PR also fixes a bug in prefill that will cause engine failure
when `n` is large.

* [SLM] Update StableLM model and migrate it to paged KV Cache (#1882)

* [KVCache] Qwen 1.0 Model PagedKV Support (#1887)

Support Qwen1.0 Paged KV Cache

* [Serving] Estimate KV cache memory usage with metadata (#1888)

Prior to this PR, the serving engine memory usage estimation reads
model config for fields such as `num_key_value_heads`,
`num_hidden_layers`, etc.. However, since not every model share the
same set of config names (#1854), the estimation fails for models
that do not have this set of config field names.

This PR makes the following changes. First, it attaches these
field values into the model's metadata, in which way we unify the
field names for different models effectively. Then, when estimating
the memory usage, we read these fields from the metadata, rather than
model config, so we are safe for the name inconsistency.

* [KVCache] Migrate bigcode arch to PagedKVCache (#1891)

Compilation and runtime smooth. I will open follow-up PRs to enable starcoder2 support in the same model definition file

* [Serving] Add Phi-2 conv template to mlc serve (#1890)

This PR adds the phi-2 model template to MLC serve.

For testing
1. Start server
```python -m mlc_chat.serve.server --model ./dist/phi-2-q4f16_1-MLC/ --model-lib-path ./dist/phi-2-q4f16_1-MLC/phi-2-q4f16_1-cuda.so --device auto --max-batch-size 2 --enable-tracing --host 127.0.0.1 --port 8000 --max-total-seq-length 8000```
2. Send request
```python test_server_rest_api.py```

```python
# test_server_rest_api.py
import requests
import json

model = "./dist/phi-2-q4f16_1-MLC/"
port = 8000
payload = {
    "model": f"{model}",
    "messages": [{"role": "user", "content": "Tell me about Machine Learning in 200 words."}],
    "stream": False,
}
r = requests.post(f"http://127.0.0.1:{port}/v1/chat/completions", json=payload)
if r.status_code != 200:
    print(r.json())
else:
    print(r.json()["choices"][0]["message"]["content"])
```

* [Attn] Fix attention kernel for head dim not divisble by 32 (#1889)

Prior to this PR, our TIR prefill attention kernel assumes the
head dim to be a multiple of 32. As reported by #1826, this assumption
does not always hold.

This PR fixes this issue so that models with different head dim can
also compile.

* [Python] Enable "thrust" for CUDA by default (#1866)

This PR enables thrust for CUDA targets so that we can
dispatch some operators (e.g., cumsum) to thrust.

* [Serving] Fix loading presharded weights (#1894)

* [Serving] Address embedding lookup OOM issue (#1899)

This PR addresses the OOM issue that may be caused by embedding
lookup when the batch size of a prefill action is large.
Prior to this PR, a large embedding tensor will be created for
each sequence in the prefilled batch, thus may take unexpectedly
large memory when the batch size is large.

* [Model] Remove redundant `batch_forward` and move broadcast (#1900)

This PR contains four changes:

1. It removes the duplicate `batch_forward` defined in model
definitions. This function was widely used prior to our migration
to PagedKVCache, since before migration the attention codepath
of single sequence forward and batch forward differ. But since our
migration, the codepaths are unified into one, and therefore we
can safely remove most `batch_forward` functions.

2. It moves `op.ccl_broadcast_from_worker0` from model main forward
(which will be called at the beginning of prefill/decode) to embedding.
This change has two benefits. Firstly, the token ids taken by `embed`
was not broadcasted across workers, and it is possible for workers
other than 0 to have illegal token ids which is not in the range of
vocab size, and moving the broadcasting to `embed` perfectly address
this issue. Secondly, broadcasting token ids in `embed` is more
lightweight than broadcasting embeddings in `prefill`/`decode`, since
the tensor size of token ids is much smaller.

3. It adds `max_batch_size` to the config class of models, so that
they are potentially compatible with batching and MLC serve.

4. It removes the `k_cache` and `v_cache` effects from the models
that have switched to PagedKVCache support.

Randomly picked a few models (as below) to run the engine test, and
all of them are passed:

* phi-2 with tp=2,
* RedPajama with tp=2,
* stablelm with tp=2 (since stablelm does not support TP right now).

* [KVCache]Migrate Qwen2 model to PagedKVCache (#1903)

* [CI] Skip not supported quantization in model compilation test (#1904)

This PR updates the model compilation test so that it will now skip
a quantization when the model does not support.

* [Serving] Add missing header for `std::iota` (#1905)

The header `<numeric>` was missed, which may have caused build
failure on Windows. This PR adds the header.

* [Serving] Fix Model TokenEmbed function with TP (#1906)

This PR fixes a severe bug introduced by #1899.

Since #1899, we no longer copy the embedding back from worker 0
when using tensor parallelism. However, we did not synchronize
with the worker 0.

This will cause the following issue: in batch prefill, we will
continuously call TokenEmbed for multiple times. Each time, we
will copy the token ids to the `token_ids` NDArray on worker 0.
If we do not synchronize with worker 0, then it is possible that
the local token ids have been updated for multiple times, before
the first `CopyToWorker0` really starts to execute on the worker 0
side. As a result, at the time of executing the token ids copy to
worker 0, the local token ids might be wrong (by "wrong", say we
are executing the copying of seq 0's token ids, then the actual
local token ids array might have already been seq 3's token ids).

As a result, the issue will cause the batch prefill behave completely
wrong. This PR adds a synchronization with worker 0 explicitly.

* [SLM] Add support for Orion architecture. (#1883)

This is a PR for supporting [OrionStarAI/Orion-14B-Chat](https://huggingface.co/OrionStarAI/Orion-14B-Chat).

* [Model] Eliminate the reshape in embedding func (#1908)

Prior to this PR, there is a trailing reshape kernel at the end of
the embedding func. The reshape is not necessarily needed to be
as a kernel, which consumes extra time during execution. This PR
eliminates the reshape in the embedding function by updating the
signature of the embedding func, so that now it only takes the plain
1D token ids as input.

* [Pass] Low batch GEMM using GEMV-like schedule (#1769)

When batch size is small, GEMM in MLP of decode stage can be
dispatched into a specialized GEMV-like schedule to improve efficiency.
GEMM with a dynamic var in spatial axis will now be lowered into 
```python
if dyn_var <= 8:
    low_batch_gemv()
else:
    normal_gemm()
```

* Auto updated submodule references

* [Serving] Avoid unnecessary worker sync in Model (#1909)

Following up #1906, this PR removes the synchronization given it is
avoidable. We use another approach to avoid the write-after-write
issue.

The key to address the issue is to make sure the addresses to be
copied to worker 0 is not rewritten before the copy actually happens.
So we pre-allocate a large host array to hold all the token ids,
and for each sequence, we copy its token ids to the offset given
when calling TokenEmbed, so that we can make sure an address will
not be written twice before copy happens.

* [Serving][Grammar] Enhance GrammarStateMatcher to support general grammar (#1917)

* [Android] Improve perf of TIR PagedAttn kernel on Android (#1915)

* android perf

* Update kv_cache.py

* Deprecate old flow (#1928)

* Deprecate old flow

This PR deprecates the old flow.
As of today most of the efforts are centralized around the new flow
with SLM compilation. Additionally, we are bringing model definitions
through unified kv interface so we can have a single model
across all backends, server and local setting.

We kept the old flow around for a while, but it is a good
time to do the transition. All the documents are updated
to point to the new flow.

We also created a backup branch
https://github.com/mlc-ai/mlc-llm/tree/backup-before-old-flow-deprecation
for people who would like to checkout some of the old flow references.

* Remove deprecated prebuilts

* [Serving] Register the StableLM3B conversation template (#1920)

Update conversation_template.py

* Remove deprecated build.py

* [Fix] KVCache creation with call_pure_packed (#1930)

With https://github.com/apache/tvm/pull/16684 merged in, the KV
cache creation will fail when compiling models. This PR fixes the
problem by using `call_pure_packed`.

* [KVCache] Update FlashInfer PackedFunc names (#1931)

This PR updates the FlashInfer names given
https://github.com/apache/tvm/pull/16692 has been merged.

* [REFACTOR] remove tests/legacy-python (#1933)

This PR removes the folder tests/legacy-python
as a followup cleanup step of the old flow

Some of the files like compare lib are useful
and we should recover them later at mlc_llm.testing.DebugChat flow

* [REFACTOR] rename mlc_chat => mlc_llm (#1932)

This PR renames the mlc_chat pckage to the mlc_llm package
now that this is the new official flow. We also update the necessary
locations that might touch the package.

* Auto updated submodule references

* [Docs] Deprecating CUDA 11.7/11.8 support (#1939)

We have deprecated the wheel support for CUDA 11.7/11.8 due to TVM
thrust compatibility with old CUDA versions.

* [Fix] Fix KV cache call in mistral (#1938)

The latest TVM introduces the wellformedness check of the IR.
The mistral model definition breaks the wellformedness due to the
purity. This PR fixes this issue.

* [ChatModule] Remove eos_token_ids (#1940)

This PR removes the eos_token_ids from the ChatModule given it is
nowhere used actually.

* [SLM] Weight conversion with generator (#1916)

This PR enhances weight conversion so that it passes a generator
to `tvmjs.dump_ndarray_cache`. This effectively reduces the CPU
memory pressure when converting weights, especially when the total
converted weight size is close to or larger to the CPU memory size.

* [Serve] Introducing GPU sampler for CUDA (#1934)

This PR introduces the GPU sampler for CUDA only. The GPU sampler
makes use of the GPU sampling ops introduced in apache/tvm#16575.

We will follow up to benchmark the performance of the GPU sampler
over CPU sampler.

* [Serve] Constrain KV cache capacity on Metal (#1943)

This PR constrains the KV cache capacity for Metal devices to 32768,
in order to avoid large tensors in KV cache. This is because right
now Metal runtime has performance issue when running a kernel where
when some input buffer is very large, even if little of the large
buffer is accesed in the kernel.

* [CI] Add windows ci (#1942)

This PR adds windows CI.

* Auto updated submodule references

* [Fix] Fix embedding shape check in ChatModule (#1953)

This PR is a fix to address #1952.

* [Fix] Fetching the Git-LFS tokenizer files (#1954)

Prior to this PR, when running commands like
```shell
python3 -m mlc_chat chat HF://mlc-ai/gemma-7b-it-q4f16_2-MLC
```
only the binary weight files are downloaded, among all the Git LFS
files.

For models like Gemma whose tokenizer is large and also in Git LFS
file, the tokenizer files are not effectively downloaded automatically.
For example, the cloned Gemma `tokenizer.json` file has content
```
version https://git-lfs.github.com/spec/v1
oid sha256:05e97791a5e007260de1db7e1692e53150e08cea481e2bf25435553380c147ee
size 17477929
```
and this content is never realized to the actual tokenizer. This will
lead to the issue of #1913.

This PR fixes the issue by pulling all the Git LFS files that are not
binary files.

* [LogitProcessor] Add max thread awareness to logit processing kernels (#1955)

Make the kernels in `AttachLogitProcessFunc` to be aware of maximum
threads, fixing https://github.com/mlc-ai/mlc-llm/issues/1951.

Most code change is due to indentation, the main change is
changing `1024` to `tx`, where `tx` is
```
tx = 1024  # default
max_num_threads_per_block = get_max_num_threads_per_block(target)
if max_num_threads_per_block < tx:
    tx = max_num_threads_per_block
check_thread_limits(target, bdx=tx, bdy=1, bdz=1, gdz=1)
```

* [Model] Use static hidden size in mixtral scatter_output (#1959)

* Auto updated submodule references

* [CompilerFlag] Detect if FlashInfer is enabled from libinfo (#1941)

This PR supports the detection of if FlashInfer is enabled when
building TVM, so that FlashInfer won't be enabled when TVM is
not built with FlashInfer enabled.

* [Serving][Grammar] Add grammar termination as a stop condition (#1964)

* Unify schema for conversation template and embed into mlc-chat-config.json (#1965)

* [SLM] Small correction on Stablelm and Qwen2. (#1958)

* small fix

* small fix

* Update stablelm_model.py

* [Serving][Fix] Fix JSON output check in test_server.py (#1966)

`test_server::is_json_or_json_prefix` is used to check the output
is JSON or a prefix of JSON. 

It uses json.loads internally. However, json.loads (i.e. json.decode)
is token-based instead of char based. If half a token is left at the
end of the string, it cannot be matched.

This PR adds another check for the rest "half a token" if it exists.

* [Model] Migrate Mistral to use PagedKVCache (#1967)

This PR migrates the mistral model to the PagedKVCache interface
which supports sliding window attention with paged attention kernel
written in TensorIR.

We thereby introduce a `support_sliding_window` mode for KV cache,
which leaves space for supporting sliding window for any model at
runtime.

This PR tests the mistral on with both chat and serve.
The chat performance of Mistral 7B gets improvement than before,
benefitted from the paged attention implementation.

* Auto updated submodule references

* [REST] Update Rest API docs for the latest serve flow (#1972)

* [Docs][Upd] Server launch, examples for endpoints for MLC Serve

* remove v1/completions

* add api docs to rest

---------

Co-authored-by: Shrey Gupta <[email protected]>

* [Conv] Add bos_token to llama and mistral in ConvTemplateRegistry (#1970)

Since we don't have the `add_bos` field in the new Conversation
template, we should add the bos token into the
system_prefix_token_ids, so that it will be added to the
tokenized prompt.

* [Model][Serve] Add support for LLaVa model in serving engine (#1974)

This PR adds support for LLaVa-v1.5 model on the serving engine.
Use the HF weights and config from https://huggingface.co/llava-hf/llava-1.5-7b-hf.

Passing image input is supported as url (reference: https://platform.openai.com/docs/guides/vision)
Example:

```python
data = {
    "model": "dist/llava-1.5-7b-hf-q4f16_1-MLC/params/",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": "https://llava-vl.github.io/static/images/view.jpg",
                },
                {"type": "text", "text": "What does this image represent?"},
            ],
        }
    ]
}
response = requests.post("http://127.0.0.1:8000/v1/chat/completions", json=data)
print("Response body:", response.text)
```

* [Serve] Hot fix for the mixtral serving (#1975)

[Fix] hotfix for the mixtral serving

Co-authored-by: Yong Wu <[email protected]>

* [REST] REST API Deprecated (#1973)

Deleted old Rest API

- Removed rest.py
- Removed old interface/openai_api.py
- Update ChatModule to use new OpenAI Api protocol

Co-authored-by: Kartik Khandelwal <[email protected]>

* [Fix] Fix handling of non-numerical cuda arch (#1976)

In the latest gpu, cuda arch may not be integer, e.g `sm_90a`.
This fixes a few places that rely on integer parsing.

* [Serving][Grammar] Support specifying the main rule in grammar (#1982)

finish

* [Fix] Fix `MLC_MULTI_ARCH` with arch `sm_90a` (#1984)

This PR fixes the missing patch for target with `sm_90a` arch, as follow up pr of #1976.

* Fix Llama-2 and Mistral conversation template. Update ConvTemplateRegistry (#1981)

The current prompt format for Llama-2 and Mistral is not
completely correct.

This PR updates the code to strictly follow the official prompt
format for the two models. Also adds in missing conv templates
to ConvTemplateRegistry.

* [SpecDecode] Fix sampler selection. (#1971)

This PR temporarily fixes sampler selection logic for speculative
decoding. As GPU sampler support for speculative decoding is
not ready, speculative decoding will use cpu sampler.

* [Serving][Grammar] Utility to convert json schema to EBNF grammar (#1983)

This PR adds a generic utility to convert json schema, especially generated from pydantic, to EBNF grammar. This helps the grammar guided generation when we provide a json schema as the restriction.

This converter features the support of json standard indent style in the output grammar.

API:
```
def json_schema_to_ebnf(
    json_schema: str,
    *,
    indent: Optional[int] = None,
    separators: Optional[Tuple[str, str]] = None,
    strict_mode: bool = True,
) -> str:
    """Convert JSON schema string to EBNF grammar string.

    Parameters
    ----------
    json_schema : str
        The JSON schema string.

    indent : Optional[int]
        The number of spaces for each indent. If it is None, there will be no indent or newline.
        The indent and separators parameters follow the same convention as
        `json.dumps()`.

    separators : Optional[Tuple[str, str]]
        The separator between different elements in json. Examples include "," and ", ".

    strict_mode : bool
        Whether to use strict mode. In strict mode, the generated grammar will not allow
        unevaluatedProperties and unevaluatedItems, i.e. these will be set to false by default.
        This helps LLM to generate accurate output in the grammar-guided generation with JSON
        schema.
    """
    pass
```

* Auto updated submodule references

* [Fix] Fix serve model to adapt the latest Allocator signature (#1989)

PR apache/tvm#16738 updated the Allocator signature. This PR
updates the caller side accordingly.

* [Model] Use optimized group gemm for Mixtral (#1988)

* [Attn] Fix the construction of attn result merge kernel (#1995)

This PR fixes the mistake of passing wrong number of heads
to the attention result merge kernel.

* [iOS][Android] Add validation of library file for iOS and Android build (#1993)

This PR adds validation of symbols in iOS and android build.
During static library build, we need the right model_lib
for us to point to the packaged model executables.

Not doing so correctly will results in vm_load_executable not found
which is not informative.

This PR we validate the compiled model lib by dumping the global symbols
and ensure the list of model libs matches with each other.

In future we should perhaps lift the validation to mlc_llm package.

* Auto updated submodule references

* [Serve] add allocator in Storage as the upstream change (#1997)

The changes in https://github.com/apache/tvm/pull/16750
modified the signature of the Storage, this pull request updates
the caller code in mlc-llm to accommodate the new Storage
class signature. Ran into build error w/o the change.

* [Compiler] Support IPC memory and customized all-reduce kernels (#1990)

This PR introduces the IPC memory and customized all-reduce kernel
dispatches for tensor parallelism. We add a new compiler flag
`--allreduce-strategy`, which supports `"ring"`, `"one-shot"` and
`"two-shot"`. The flag defaults to `"ring"`, which means this PR
makes no difference if people do not manually change the all-reduce
strategy.

As of now the IPC-memory-backed customized all-reduce kernels are
only available on CUDA.

To enable all-reduce strategies other than "ring", here are some
example compile commands:
```python
python -m mlc_llm compile model/mlc-chat-config.json --device cuda --opt "allreduce-strategy=one-shot" -o model/lib.so
python -m mlc_llm compile model/mlc-chat-config.json --device cuda --opt "allreduce-strategy=two-shot" -o model/lib.so
```

Please be aware that, you probably also need to specify other
compiler flags, for example, like `--opt "cublas_gemm=1;allreduce-strategy=one-shot"`.

* Auto updated submodule references

* [Model] Fix the top-k TIR script for well-formedness (#2002)

This PR fixes the malformed MoE TIR scripts.

* Fix invalid use of dataflow var in sampler output (#2003)

* [Fix] Fix KV cache creation pass after nn.Module changes (#2011)

This PR corrects the assertion after latest changes in apache/tvm
that updates some nn.Module behavior.

* [iOS] Fix typo in prepare_model_lib.py (#2013)

Fix typo in prepare_model_lib.py

tar_list.append(valid_paths[ls0]) is introduced by mistake in https://github.com/mlc-ai/mlc-llm/pull/1993

* Remove unstable assertion in KV cache creation dispatch (#2017)

This particular assertion is unstable recently given the back-and-forth upstream TVM nn.Module exporter behavior.

* Auto updated submodule references

* [SLM] Qwen2 Multi-GPU support (#1985)

* Update qwen2_model.py

* fix lint issue

* fix lint issue

* fix lint issue

* more info for preshard  (#2027)

* When the pre-sharded version of a certain model is not available, the program will default back to the normal workflow without issuing any alert. Now, when someone attempts to convert to a pre-sharded model but cannot, the program will throw a warning message to inform users that it will revert to the standard model conversion process.

* format fix.

* black reformatted, i did not see any diff.

* black reformatted..

* Register stablelm-2 conversation template (#2029)

* [Serving][Fix] Fix problems in PopenServer (#2032)

This PR fixes several problems in the PopenServer:

- Add check for the server is not started and the request returns a fail
number, e.g. 502. And changed the retry time to 0.1s.

- Add a `__enter__` and `__exit__` method for PopenServer.
When the program is interrupted, using with clause (`__enter__`
and `__exit__`) can ensure the server always terminates. When
using `start()` and `terminate()`, the server  may still be staying
in the background even though the parent process ends.

* [Quantization] Skip MoE gate layer (#2012)

This PR skips quantizing the MoE gate layer.

* [Serving][Grammar] Integration of JSON schema generation (#2030)

Previous PR #1983 introduced a transformation from json schema
to BNF grammar.

This PR further integrates the grammar from json schema to the
generation pipeline, so that the engine now supports json schema
output. GrammarStateInitContexts are stored in a cache, so it will not
be created again with the same schema.

Interface:

- Python
```
@dataclass
class ResponseFormat:
    type: Literal["text", "json_object"] = "text"
    schema: Optional[str] = None
```

- Rest API
```
class RequestResponseFormat(BaseModel):
    type: Literal["text", "json_object"] = "text"
    json_schema: Optional[str] = Field(default=None, alias="schema")

class CompletionRequest(BaseModel):
    ...
    response_format: RequestResponseFormat = Field(default_factory=RequestResponseFormat)

class ChatCompletionRequest(BaseModel):
    ...
    response_format: RequestResponseFormat = Field(default_factory=RequestResponseFormat)
```

Performance:

We only tests single-batch performance now to show the overhead in latency.

- Model: `Llama-2-7b-chat-hf-q4f16_1`
- GPU: `NVIDIA GeForce RTX 3080`
- CPU: `AMD Ryzen 9 5900X 12-Core Processor`

```
JSON ON Batch=1
Average prefill tokens: 651.0000 tok/req
Average decode tokens: 499.0000 tok/req
Single token prefill latency: 0.3140 ms/tok
Single token decode latency: 8.6831 ms/tok
Prefill token throughput: 3184.8002 tok/s
Decode token throughput: 116.6039 tok/s

JSON OFF Batch=1
Average prefill tokens: 651.0000 tok/req
Average decode tokens: 499.0000 tok/req
Single token prefill latency: 0.3098 ms/tok
Single token decode latency: 8.6823 ms/tok
Prefill token throughput: 3227.8141 tok/s
Decode token throughput: 116.9251 tok/s
```

This PR also does these bug fixes / changes:
- Changed the structure of the converted grammar from schema
to avoid large amount of uncertain tokens, which caused a
performance degradation

* [Compiler] Support AUTO mode for all-reduce strategy (#2034)

This PR supports the auto mode for IPC all-reduce strategy.
It renames the strategy from `allreduce-strategy` to
`ipc-allreduce-strategy` in the compiler optimization flags. The
default RING mode is renamed to NONE mode, which, when specified,
uses nccl all-reduce without any IPC memory rewrite.

So right now to enable IPC all-reduce, the ideal way is to do
`ipc-allreduce-strategy=auto`.

* [LLaVa] Follow-up for TODOs in LLaVa model (#2010)

Llava: 1. Added base64 image support.
2. Merged as_prompt and as_prompt_list.
3. get_image_from_url uses config

* [Pipeline] Defer GPU IPC memory lowering (#2038)

This PR moves the position of GPU IPC memory lowering pass in pipeline,
so that it applies after the CUDA graph rewrite to enable CUDA graph
with the customized all-reduce kernels.

* [Model] Add missing broadcast of logit_position for multigpu (#2040)

This commit adds the broadcasting of `logit_pos` in batch prefill
for all models to avoid the logit position out-of-bound issue.

* [Preshard] apply presharding after quantization (#2039)

This change the behavior of presharding by apply presharding
after quantization. This makes the behavior consistent with or
without presharding

* [SLM] Baichuan Multi-GPU support (#2037)

This PR enables TP function of Baichuan2 model.

* Auto updated submodule references

* [Model] Skip TVMSynchronize when tracing is not enabled (#2041)

This PR removes the synchronization in `Model` when Chrome tracing
is not enabled. It can help some logit process kernels launching
earlier.

* [Serving] Support NVTX for benchmarking (#2043)

This PR supports MLC serve with NVTX which helps analyzing benchmarking
results.

**Note.** To enable NVTX, please add `set(USE_NVTX ON)` to file
`build/config.cmake`.

* Update huggingface_loader.py

* [Serve] Separate callback invocation to another thread in AsyncEngine (#2046)

This PR enhances the AsyncThreadEngine by separating the callback
invocation to another thread, in order to reduce the CPU time overhead
of invoking Python callback.

* [LLaVa] Fix random token output after first sentence (#2048)

Fix Llava random token after first '.' token

Co-authored-by: Animesh Bohara <[email protected]>

* Auto updated submodule references

* [Pass] Fix LiftGlobalBufferAlloc for proper GlobalVar struct info (#2053)

This PR fixes the GlobalVar struct info mismatch issue cased by
pass LiftGlobalBufferAlloc after a latest TVM commit.

* Auto updated submodule references

* [Serving] CLI Support for SERVE (#2014)

This PR adds CLI support for serve.

Usage:

`mlc_llm serve [Model]`

refer `mlc_llm serve -h` for more options

Comments
- Supports JIT compilation of Model lib
- Added context manager to `ServerContext` class

Co-authored-by: Ruihang Lai <[email protected]>
Co-authored-by: Shrey Gupta <[email protected]>

* [Pipeline] Insert hints to enable cuda graph symbolic capture (#2050)

* [Pipeline] Add pass to insert hints to enable cuda graph symbolic capture

* [Loader] Print message when multi-GPU loader is finished (#2051)

* [Loader] Print message when multi-GPU loader is finished

* Update multi_gpu_loader.cc

* fix

* [KVCache] Support matching arbitrary element offset for aux data (#2057)

This PR enhances the TIR attention-related functions to support
matching arbitrary element offests. This makes room for the KV cache
to allocate a large array the all the auxiliary data and do slicing
on it.

This PR should affect nothing for the current codebase, given all
the element offsets are zeros as of now.

* [Serving] Support copy stream in LogitProcessor and GPUSampler (#2058)

This PR introduces copy stream to LogitProcessor and GPUSampler
for CUDA, so that auxiliary data can be copied on a separate stream
and overlap with the computation time.

* [SLM] Stablelm Multi-GPU support (#2052)

This PR enables TP function of Stablelm model.

* [KVCache] Introducing single page copy func for KV cache fork (#2060)

This PR introduces the single page copy TIR function for KV cache.
This function is helpful for sequence fork at specified positions.

NOTE: this PR is a breaking change, so you will need to re-compile
your model and update TVM or the MLC-AI pip package to the latest.

Related PR: apache/tvm#16813

Co-authored-by: Yaxing Cai <[email protected]>

* [Python] Implement testing.DebugChat for end-to-end model debugging (#2056)

* [Docs] Fix docs for python server and rest call (#2066)

This PR updates the MLC serve documentation for server launching.

* [CI] Enable submodule clone for WASM model compilation (#2068)

The incoming WASM runtime requires 3rdparty for builds. This PR enables
the submodule clone for WASM model compilation in CI.

* [Serve] Fork sequence at specified positions (#2067)

With PagedKVCache supporting fork at a specified position, this PR
updates `Model` interface accordingly. The fork position defaults
to -1, which means the last position.

* [SLM] Add support for RWKV6 model  (#1977)

* [SLM]: Support for rwkv tokenizer

* [SLM] RWKV6 World Support

* [Quantization] Reorganize utils code in group_quantization (#2055)

* [Serving] Bugfix for empty stop string  (#2070)

add check for empty stop string; fix Vanilla LM conversation template

* [SLM] Internlm Multi-GPU support (#2072)

This PR enables tensor parallelism support for InternLM model.

* [WebGPU] Add mlc wasm runtime, support grammar in web (#2061)

* [WebGPU] Add mlc wasm runtime, support grammar in web

* Make in web for wasm ci

* Fix wasm ci

* Fix wasm ci

* Change export library arg name

* Move macro to cc instead of makefile

* [Build] Use TVM_HOME environment variable (#2073)

Prior to this commit, the `CMakeLists.txt` file checked a cmake
`TVM_HOME` variable, but did not check the usual `TVM_HOME`
environment variable.  If this variable is set, it should be used.

* [Serving] Support input chunking (#2069)

This PR supports input chunking with regard to customized
"prefill chunk size" (field `prefill_chunk_size` in
`mlc-chat-config.json`). With this PR, we can now chunk a long input
into multiples when there is an upper limit on the prefill chunk size.
Only `TokenData` is supported for now.

* [Docs] API Code Completion Guide (#2054)

* Allow "mlc_llm --host" option to override host triple the model compi… (#2074)

Allow "mlc_llm --host" option to override host triple the model compile to

* [Web] Move prep emcc deps script to web folder (#2077)

* [SLM] Qwen Multi-GPU support (#2075)

* Fix mismatch of metadata func and global symbol (#2078)

* Fix mismatch of metadata func and global symbol

* Update estimate_memory_usage.py

* [Disco] Set worker CPU affinity with env variable (#2042)

This PR enables setting the CPU affinity of disco workers in
MLC, following the support in apache/tvm#16807. The purpose is
to try reduce the CPU core switch overhead brought to disco workers
which may cause extra bubble times in disco workers before/during
tasks.

We use a macro `MLC_DISCO_WORKER_CPU_BINDING` to specify the CPU
affinities of workers. This is by default not used. To enable it,
you can run the command like

```shell
MLC_DISCO_WORKER_CPU_BINDING=64,65,66,67 python some_mlc_app.py
```

to specify the four CPU core ids for the four workers.

* [Quantization] Introduce PerTensor and F8 quantization (#2079)

* [Quantization] Introduce PerTensor and F8 quantization

* address comments

* [Serving][Refactor] Rename AsyncThreadedEngine to ThreadedEngine (#2081)

This PR renames the AsyncThreadedEngine to ThreadedEngine to
prepare for follow up refactors of Python interface. Meanwhile,
this PR exposes a creation function for AsyncThreadedEngine so that
it can be further used by others, such as JSONFFIEngine.

* [Serving] Add cuda profiling in benchmark test (#2084)

* [Serving] Add cuda profiling in benchmark test

* [Grammar] Fix broken grammar tests (#2083)

This PR fixes some grammar parser tests that were broken.

* [Serving][Fix] Fix chunked prefill condition (#2082)

This PR fixes a bug when trying to chunk an input and do prefill.
The stats prior ot this PR was wrong.

* [Conversation] Fix RedPajama conversation template (#2087)

As reported and discussed in #2086, this PR fixes the RedPajama
template.

* [Serving][Refactor] Python interface refactor (#2085)

This PR is an initial major Python interface refactor of MLC Serve.

With this PR, `mlc_llm.serve` in Python now exposes two engine classes:
`AsyncEngine` and `Engine`. Both classes have two entrypoints,
`chat_completion` and `completion` which conform to OpenAI Python API
(reference: https://github.com/openai/openai-python).

As the name suggested, `AsyncEngine` works asynchronously, and `Engine`
works synchronously. It worths noting that the `Engine` since this PR
is different from the `Engine` so far. The new `Engine` does not provide
interfaces for batch generation.

For robustness and correctness, the old `Engine` in Python is moved
to `mlc_llm.serve.sync_engine.SyncEngine`. We do not directly expose
this SyncEngine, and it now mainly serves testing and debug purposes.
It is useful to check the correctness of new features, because of its
simplicity. It keeps the low-level interface to directly invoke `step()`
function of the engine, and also keeps the low-level batch generation
interface.

Our REST API entry points defined under `mlc_llm/serve/entrypoints/`
are also refactored accordingly to adapt to the latest Python API
in MLC Serve. In short, most of the logic in OpenAI API entry points
are moved to Python API, which simplifies the implementation of
entry points.

Please note that this is the first (also the largest) planned refactor.
We will follow up with some other refactors, which have smaller scopes
compared with this PR. The planned refactors include:

* provide submodule interface to align OpenAI Python package in
https://github.com/openai/openai-python
* refactor the constructor interface of `Engine`/`AsyncEngine` to
align the MLC serve CLI interface.

* [Serving] Separating ThreadedEngine creation and initialization (#2090)

This PR separates the creation and initialization of ThreadedEngine
for multi-threading use cases. So we can make sure that the
ThreadedEngine instance is created before any other operations
(such as initialization, running background loop, etc.).

* [Serving] Enhance robustness with small KV capacity (#2091)

This PR enhances the robustness, which had issue when the KV capacity
is small.

* [REST] Update REST API docs (#2092)

This updates the rest docs to use `mlc_llm serve` and also adds a quick start section.

* [DOCS] Clarify vulkan loader dependency (#2095)

This PR clarifies the vulkan loader dependecy.
Some system may not have the right vulkan loader
and we need to install them via conda.

* [SLM] Add support for Chatglm3 architecture (#2096)

This pr enable Chatglm3 model.

* [Quantization] Add OpenCL device (#2097)

This PR adds OpenCL device for weight conversion.

* [Serving] Support stream=True for Python API (#2098)

The previous refactoring PR formalizes the MLC serve Python API
but does not respect the `stream` flag properly: no matter if
`stream` is True or False, the functions always work in a streaming
style. This PR supports the non-stream case.

* [Serving][Refactor] OpenAI API Python interface alignment (#2099)

This PR aligns the Python API of chat completions and completions MLC
serve with the OpenAI Python package https://github.com/openai/openai-python.

Specifically, say we first create an engine or async engine, then
we can use entrance `engine.chat.completions.create(...)` for chat
completions.

We will add more use examples in the codebase after another few
refactors.

* [DOC] fix small python env install error (#2102)

Fixed one slight issue of tvm install: would require specify python=3.11
on the platform otherwise might encounter python not found error.

* [JSONFFIEngine] Initial implementation of JSONFFIEngine (#2101)

This PR introduces initial support for the JSONFFIEngine.
The request is supposed to be a JSON string in the
[Chat completion request body format](https://platform.openai.com/docs/api-reference/chat/create).
The output (input to the callback function provided) is a list of
JSON strings in the [Chat completion chunk object format](https://platform.openai.com/docs/api-reference/chat/streaming).

There is still functionality to be added, which will be added in follow-up PRs. 
1. Support for other input datatypes (image, etc.)
2. Applying conversation template to input
3. Function calling and tools support
4. Generation config parameters support
5. Independent text streamers for each request
6. logprobs support

---

Co-authored-by: Ruihang Lai <[email protected]>

* [Model] Use tanh approximation of GeLU in Gemma MLP (#2106)

This is in line with the implementation in the [transformers](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma/modeling_gemma.py#L183) library.
Also, the [gemma-1.1](https://huggingface.co/google/gemma-1.1-2b-it/blob/main/config.json#L10) model config.

* Auto updated submodule references

* [Quantization] Stricter checks for MoE gate (#2109)

This PR strenthens the MoE gate checks to include checking number of
experts, given the real MoE gate router layer's output feature number
is the number of experts and is usually very small.

This PR comes from a regression that there is a layer in RWKV6 that
ends with name "gate" is not for MoE at all.

* Auto updated submodule references

* [LLaVa] Fix allowed text model value in config (#2062)

* Llava support vicuna and mistral text models

* Support f32 quantization

* Lint fix

* Use preset if transformers not installed

* Rebase on main

---------

Co-authored-by: Animesh Bohara <[email protected]>

* Auto updated submodule references

* Revert "Allow "mlc_llm --host" option to override host triple the model compi…" (#2115)

This reverts commit 12ca8fdbe2a24f43bbc72241a76735dbad8c2026.

Co-authored-by: Mengshiun Yu <[email protected]>

* Revert "Auto updated submodule references" (#2117)

This reverts commit c4169d8c8a4afedd06bc9d9b99c3aa65eee4a89e
which causes CI broken.

* [Metadata] Include picojson rather than forward declaring (#2118)

This PR fixes the picojson uses in MLC that conflicts with the latest
changes on the picojson side.

* Auto updated submodule references

* Auto updated submodule references

* [Serving][Grammar] Porting the json schema converter from python to C++ (#2112)

[Serve][Grammar] Porting the json schema converter from python to C++

This PR ports the json schema converter from python to C++. It defines
the interface:
```
std::string JSONSchemaToEBNF(
    std::string schema, std::optional<int> indent = std::nullopt,
    std::optional<std::pair<std::string, std::string>> separators = std::nullopt,
    bool strict_mode = true);
```

And uses it in BNFGrammar::FromSchema.

This helps cases where python cannot be deployed.

* [Model] Use R.topk/cumsum for mixtral (#2107)

* Enable flashinfer when group_size == 6 (#2124)

* [SpecDecode] Support Eagle in speculative decoding (#2080)

1. Add Eagle-Llama-7b-chat model support.
2. Add speculative decoding support with Eagle.

* [Pass] Attach non-negative TIR var attributes (#2125)

This PR attaches the attributes of `tir.non_negative_var` for memory
planning.

* [Serving][Refactor] Engine constructor interface refactor (#2126)

This PR is a refactor of the engine's contructor interface
and the serve CLI interface.

This PR introduces the "mode" argument for engine, which has options
"local", "interactive" and "server". The choice of mode will affect
the automatically inferred value of `max_batch_size`,
`max_total_sequence_length` and `prefill_chunk_size` (only effective
when arguements are not specified. Once an argument is specified,
we will not override it). For detailed specification of the mode,
please check out the CLI help messages in `mlc_llm/help.py` or the
engine constructor in `mlc_llm/serve/engine.py`.

No matter which mode is chosen, we will print out the current mode
and the values of these arguments, for peopple to understand the
settings of the engine. We also provide hints on how to adjust the
mode. For example,

```
[2024-04-12 16:12:26] INFO chat_module.py:379: Using model folder: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16-MLC
[2024-04-12 16:12:26] INFO chat_module.py:380: Using mlc chat config: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16-MLC/mlc-chat-config.json
[2024-04-12 16:12:26] INFO chat_module.py:529: Using library model: dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so
[2024-04-12 16:12:26] INFO chat_module.py:379: Using model folder: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q4f16_1-MLC
[2024-04-12 16:12:26] INFO chat_module.py:380: Using mlc chat config: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json
[2024-04-12 16:12:26] INFO chat_module.py:529: Using library model: dist/Llama-2-7b-chat-hf-q4f16_1-MLC/Llama-2-7b-chat-hf-q4f16_1-MLC-cuda.so
[2024-04-12 16:12:29] INFO engine_base.py:382: Engine mode is "local". Max batch size is set to 4. Max KV cache token capacity is set to 4096. Prefill chunk size is set to 4096.
[2024-04-12 16:12:29] INFO engine_base.py:387: Estimated total single GPU memory usage: 21543.74 MB (Parameters: 16467.64 MB. KVCache: 4450.07 MB. Temporary buffer: 626.03 MB). The actual usage might be slightly larger than the estimated number.
[2024-04-12 16:12:29] INFO engine_base.py:398: Please switch to mode "server" if you want to use more GPU memory and support more concurrent requests.
```

After the refactor, we bring the speculative decoding to the serve
CLI so that people can use multiple models and run speculative
decoding with the server launched in CLI (which was not doable before).

* [Serving] Revamp engine mode selection logging info (#2128)

This PR revamps the logging info for engine mode selection to provide
more detailed information and the rationale of different modes.

* [SLM] Chatglm3 Multi-GPU support (#2123)

This PR enables TP for Chatglm3 model.

* [Serving] Fix support of large `n` under low max batch size (#2136)

Prior to this PR, due to the improper prefill policy on `n` (parallel
generation), the engine will loop forever when the a request has `n`
larger than the maximum batch size that the engine can support.

This PR fixes this issue by updating the prefill action, and with this
PR, even the "interactive" engine mode can well support multiple
parallel generation.

After this fix, it is possible that a request require 10 parallel
generation while the max batch size is 1. Given the shapes of temporary
NDArrays in GPU sampler is determined by the max batch size, GPU sampler
does not natively support sampling 10 tokens at a time. To approach
this issue, this PR introduces chunking to GPU sampler. Therefore,
in this particular case, the GPU sampler will have chunk size 1,
and the 10 required samples will be processed by the GPU sampler
one by one in order. Chunking is the minimum change we can do to support
large `n`.

* [Docs] Revamp landing page with Engine Python API and server (#2137)

This PR revamps the landing documentation page.

* The Python API panel is changed from showing ChatModule to showing
Engine.
* A new panel "REST Server" is added to show a quick start example
of launching REST server and send request.
* A "what to do next" section is introduced at the bottom of the
landing page.

Todo items for future PR:

* add the page of Python API with Engine.
* revamp weight conversion page.
* revamp model library compilation page.

* [Target] Update Target tags (#2141)

The commit updates the target tags, in order to identify the different
SoC hardware targets for further target-specific optimizations.

Meanwhile, update the vulkan support for int64.

* [Util] Support debug debug_compare (#2142)

* [Minor][SpecInfer] Fix Optional FC Bias for Mixtral Eagle Model (#2146)

* Add optional fc bias for mixtral.

* Fix lint.

* [Serving] fix hardcoded host and port in popen_server (#2147)

* [Docs] Introductory tutorial (#2145)

This PR updates the documentation with an introduction turorial.
The landing page now directs to the quick start page and the tutorial.

* [Serving] Support `DebugCallFuncOnAllAllWorker` and CUDA profiler (#2148)

This PR adds a new function `DebugCallFuncOnAllAllWorker` which calls
a global function of sigunature `[] -> None` on all distributed workers
when tensor parallelism is enabled (or the local session itself if not
enabled).

As the name suggests, this function is only for the debug purpose, and
we will not expose any public interface to invoke this function.

This PR also introduces the global functions
`"mlc.debug_cuda_profiler_start"` and `"mlc.debug_cuda_profiler_stop"`,
which enables CUDA profiling when using PopenServer.

* [DOCS] Update introduction (#2151)

* [DOCS] Update introduction

Some minor tweaks on the introduction doc

* Update docs/get_started/introduction.rst

Co-authored-by: Ruihang Lai <[email protected]>

---------

Co-authored-by: Ruihang Lai <[email protected]>

* [Serving][Python] Rename Engine to LLMEngine (#2152)

We rename the public Python serve interface from `Engine` to
`LLMEngine` (and from `AsyncEngine` to `AsyncLLMEngine` accordingly)
for better class name clarity.

This is because in cases people do wildcard import, in which case
the name `Engine` itself does not convey enough meaning.

* Auto updated submodule references

* [Quantization] Add e4m3 mode and enable fp8 storage type (#2154)

* [Quantization] Add e4m3 mode and enable fp8 storage type

* add quantize linear flag

* Revert "[Quantization] Add e4m3 mode and enable fp8 storage type" (#2158)

Revert "[Quantization] Add e4m3 mode and enable fp8 storage type (#2154)"

This reverts commit e9a4a0bf719a7c4fd42b438cf9e159a1e8d72590.

* [Serving] EngineConfig refactor (#2159)

This PR refactors EngineConfig for a cleaner interface of internal
Engine constructor in MLC serve. This is a preparation step towards
the engine reload/unload which will be introduced in follow-up PRs
for JSONFFIEngine functionality on mobile and other platforms.

* temporary hack for byoc

---------

Co-authored-by: Ruihang Lai <[email protected]>
Co-authored-by: Bohan Hou <[email protected]>
Co-authored-by: Yixin Dong <[email protected]>
Co-authored-by: Git bot <[email protected]>
Co-authored-by: Charlie Ruan <[email protected]>
Co-authored-by: Eric Lunderberg <[email protected]>
Co-authored-by: Shushi Hong <[email protected]>
Co-authored-by: Egor Churaev <[email protected]>
Co-authored-by: Siyuan Feng <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: Tianqi Chen <[email protected]>
Co-authored-by: Kartik Khandelwal <[email protected]>
Co-authored-by: Shrey Gupta <[email protected]>
Co-authored-by: Diego Cao <[email protected]>
Co-authored-by: David Pissarra <[email protected]>
Co-authored-by: Wuwei Lin <[email protected]>
Co-authored-by: Ricardo Lu <[email protected]>
Co-authored-by: Hongyi Jin <[email protected]>
Co-authored-by: Bohan Hou <[email protected]>
Co-authored-by: tqchen <[email protected]>
Co-authored-by: Rick Zhou <[email protected]>
Co-authored-by: Animesh Bohara <[email protected]>
Co-authored-by: Yong Wu <[email protected]>
Co-authored-by: Yong Wu <[email protected]>
Co-authored-by: Shrey Gupta <[email protected]>
Co-authored-by: Yaxing Cai <[email protected]>
Co-authored-by: ZCHNO <[email protected]>
Co-authored-by: Andrew <[email protected]>
Co-authored-by: na20215 <[email protected]>
Co-authored-by: Animesh Bohara <[email protected]>
Co-authored-by: Yogesh Garg <[email protected]>
Co-authored-by: Linyu Wu <[email protected]>
Co-authored-by: Yu Xuanchi <[email protected]>
Co-authored-by: Mengshiun Yu <[email protected]>
Co-authored-by: Jeethu Rao <[email protected]>
Co-authored-by: Xiyou Zhou <[email protected]>
sunggg added a commit to octoml/mlc-llm that referenced this pull request Apr 29, 2024
… April 29th 2024) (#265)

* [Serving][Grammar] BNF grammar simplifier and matcher (#1801)

* [Serving] LogProbs support (#1832)

This PR introduces the logprobs support with OpenAI API
compatibility. It enhances the sampler with a function to get
the top-probability tokens (supporting 5 tokens at most as of now).

To make it easy to pass logprob results back from serving engine
to frontend, we choose to pass logprob results in JSON string with
OpenAI API spec.

Unit tests are added to ensure the correctness of logprobs.
And the logprobs support also work with speculative decoding.

* [Serving] Support Mixtral in MLC Serve (#1840)

This PR supports Mixtral in MLC serve. The main thing is only
introducing the Mistral conversation template to Python registry
so that MLC Serve can use.

Besides that, this PR updates the KV cache capacity analysis to
make it more accurate in terms of usage calculation, while being
conservative since there is a known issue regarding batch-prefill
embedding taking which may lead to OOM. We will reset the follow up
on the issue with a fix in the future and then enable the estimation
to use more GPU vRAM.

* [Fix] Fix `u_char` for Windows build (#1848)

Prior to this PR, `u_char` was used while it is not a standard
type in C++, which causes Windows build failure.

This PR fixes it by using `unsigned char`.

* Auto updated submodule references

* [Fix] Add phi lm head name to is_final_fc, add q4f16_ft to CI (#1849)

[Fix] Add phi lm head name to is_final_fc

* [Build] Replace mod_transform_before_build with IRModule pass (#1852)

Instead of a python function that returns an updated `IRModule`, the
new `optimize_mod_pipeline` function returns a `tvm.ir.transform.Pass`
which can be applied to an `IRModule`.

* [SLM] Add support for InternLM architecture (#1835)

* Create __init__.py

* Add files via upload

* Update model.py

* Update model_preset.py

* Update conv_templates.cc

* Update internlm_loader.py

* Update internlm_quantization.py

* fix name of notes

* Update model.py

* Migration

* fix pylint issue

* fix pylint issue

* fix pylint error

* Update internlm_loader.py

* Update __init__.py

* Update __init__.py

* Delete python/mlc_chat/model/internlm/__init__.py

* Add files via upload

* [Bugfix] Handle model names with multiple path components (#1851)

Prior to this commit, a model name with multiple path
components (e.g. `dist/models/group_name/model_name`) would have
duplicated path components
(e.g. `dist/group_name/artifact_path/group_name/libname.so`).
This commit resolves the duplication.

* [KVCache] Add max num threads awareness to KVCache kernels (#1822)

* [KVCache] Add max num threads to KVCache kernels, fix WebGPU

* Read max_num_threads_per_block when available

* Change merge state in place kernel

* Make attention decode aware of max num threads, not just webgpu

Co-authored-by: Egor Churaev <[email protected]>

* Change util function name

---------

Co-authored-by: Egor Churaev <[email protected]>

* [KVCache] Migrate Baichuan model to PagedKVCache (#1854)

* [Python] Lazy import of transformers for tiktoken conversion (#1860)

This PR moves the import of transformers into the function body
of tiktoken tokenizer conversion, so we do not have a force dependency
on transformers.

* [SLM] RWKV5 World Support (#1787)

This PR adds RWKV5 support with RNNState, a similar interface as
PagedAttention.

Co-authored-by: Xiaoyu Zhang <[email protected]>

* [Serving] Register the ChatML conversation template (#1862)

Following #1854 , this pr registers the ChatML conversation template.

* [Utils][Transform] Added SetEntryFuncs transform (#1855)

Sets the entry functions for a module.  This utility is intended for
cases where only module contains several externally-exposed functions,
and only one is desired for use.  (e.g. Separating out a
`transform_params` function from an `IRModule` that also contains
inference functions.)  This commit only updates the external
visibility, after which `relax.transform.DeadCodeElimination()` can be
applied.

* [Build] Update transform_params_for_each_rank to IRModule pass (#1856)

This allows it to be used as part of a optimization pipeline specified
as a `tvm.ir.transform.Sequential`.

* [Serving][Grammar] Integrate JSON grammar into the generation pipeline (#1867)

This PR is the 3rd part of the grammar-guided generation.
This intregrates the grammar framework into the generation
process, and supports JSON output for now.

The API this PR provides is compatible with the OpenAI api.

### APIs
#### Python API
```
@dataclass
class ResponseFormat:
    type: Literal["text", "json_object"] = "text"
    json_schema: Optional[str] = None

@dataclass
class GenerationConfig:
        response_format: ResponseFormat = ResponseFormat(type="text")
```

#### Rest API
```
response_format: { "type": "text" } # text generation, by default
response_format: { "type": "json_object" } # json generation
response_format: { "type": "json_object", json_schema="..."} # json generation with schema
```

JSON generation with schema is not supported yet,
but has been planned to be realized in the future.

### Performance
#### Without JSON
```
Single token prefill latency: 891.2234 ms/tok
Single token decode latency: 31.3399 ms/tok
Prefill token throughput: 4693.3077 tok/s
Decode token throughput: 226.4406 tok/s
Overall token throughput: 470.3180 tok/s
```
#### With JSON
```
Single token prefill latency: 219.2287 ms/tok
Single token decode latency: 29.1399 ms/tok
Prefill token throughput: 7392.1555 tok/s
Decode token throughput: 179.2296 tok/s
Overall token throughput: 1052.1996 tok/s
```

We observed a slight decrease in performance under JSON mode.
This will be further optimized in the future.

* [Serving] Support "n" for parallel generation (#1868)

This PR brings field `n` to generation config and thereby
supports parallel generation. This parallel generation effectively
leverages the "fork" functionality of paged KV cache.

This PR supports specifying the number of parallel generation
`n` in stardard OpenAI ChatCompletion API. This is the last
feature towards the OpenAI API feature completeness.

* [CI] Add retry to scm checkout (#1869)

Sometimes scm checkout can timeout, this PR add retry to that

* [Attn] Use float32 accumulation in attention kernel (#1870)

Prior to this PR, the TIR attention kernels does not cast matmul
operands to fp32 before multiplying.
For models like Phi-2 which may have large Q/K/V data (at the level
of a few hundreds), the fp16 multiplication exceeds the range of
fp16, and lead to attention result being NAN sometimes.

This PR fixes this issue.

* [Utils] Allow ReorderTransformFunc to be used without param manager (#1857)

Prior to this commit, the `ReorderTransformFunc` required several
components of the `ParamManager` to use.  The functionality it
provides, reordering dataflow blocks to minimize the liveset, is
useful outside of the context of the `ParamManager`.  This commit
makes the following changes, allowing it to be used independently of
the `ParamManager`.

- Generate the `pidx2binname` dictionary outside of `ReorderTransformFunc`

- Allow parameters to be separate `func.params`, rather than a single
  bundled tuple parameter.

* [SLM] Migrate Phi-2 to paged KV Cache #1871 (#1872)

This PR migrates Phi-2 for Paged KV cache Attention as a part of Model definition migration according to #1749 .

Co-authored-by: Shrey Gupta <[email protected]>

* [Fix] Fix the use of "call_inplace_packed" and "call_pure_packed" (#1874)

The use of `call_inplace_packed` and `call_pure_packed` in the old
flow is outdated due to signature changes. This PR fixes the issue.

* [Fix] Add the missing BundleModelParams pass (#1875)

PR #1852 missed to apply the BundleModelParams pass and thus made
the compiled models not runnable through ChatModule (#1864). This PR
fixes the issue.

* [Docs] Update Android APK download link (#1876)

As pointed out by #1830, this PR fixes the Android app download
link in docs.

* Fix MLC-LLM website link weight convert not accessible (#1877)

Fix website link not accessible

* [Serving][Grammar] Support termination state in GrammarStateMatcher (#1884)

* [Serving] Make RequestState as a standalone object class (#1878)

This PR adopts suggestions from the support of OpenAI API parallel
generation `n` in #1868. The main update in this PR is to make
the RequestState as a standalone object class, which was a typedef
from `std::vector<RequestStateEntry>` before.

This PR also fixes a bug in prefill that will cause engine failure
when `n` is large.

* [SLM] Update StableLM model and migrate it to paged KV Cache (#1882)

* [KVCache] Qwen 1.0 Model PagedKV Support (#1887)

Support Qwen1.0 Paged KV Cache

* [Serving] Estimate KV cache memory usage with metadata (#1888)

Prior to this PR, the serving engine memory usage estimation reads
model config for fields such as `num_key_value_heads`,
`num_hidden_layers`, etc.. However, since not every model share the
same set of config names (#1854), the estimation fails for models
that do not have this set of config field names.

This PR makes the following changes. First, it attaches these
field values into the model's metadata, in which way we unify the
field names for different models effectively. Then, when estimating
the memory usage, we read these fields from the metadata, rather than
model config, so we are safe for the name inconsistency.

* [KVCache] Migrate bigcode arch to PagedKVCache (#1891)

Compilation and runtime smooth. I will open follow-up PRs to enable starcoder2 support in the same model definition file

* [Serving] Add Phi-2 conv template to mlc serve (#1890)

This PR adds the phi-2 model template to MLC serve.

For testing
1. Start server
```python -m mlc_chat.serve.server --model ./dist/phi-2-q4f16_1-MLC/ --model-lib-path ./dist/phi-2-q4f16_1-MLC/phi-2-q4f16_1-cuda.so --device auto --max-batch-size 2 --enable-tracing --host 127.0.0.1 --port 8000 --max-total-seq-length 8000```
2. Send request
```python test_server_rest_api.py```

```python
# test_server_rest_api.py
import requests
import json

model = "./dist/phi-2-q4f16_1-MLC/"
port = 8000
payload = {
    "model": f"{model}",
    "messages": [{"role": "user", "content": "Tell me about Machine Learning in 200 words."}],
    "stream": False,
}
r = requests.post(f"http://127.0.0.1:{port}/v1/chat/completions", json=payload)
if r.status_code != 200:
    print(r.json())
else:
    print(r.json()["choices"][0]["message"]["content"])
```

* [Attn] Fix attention kernel for head dim not divisble by 32 (#1889)

Prior to this PR, our TIR prefill attention kernel assumes the
head dim to be a multiple of 32. As reported by #1826, this assumption
does not always hold.

This PR fixes this issue so that models with different head dim can
also compile.

* [Python] Enable "thrust" for CUDA by default (#1866)

This PR enables thrust for CUDA targets so that we can
dispatch some operators (e.g., cumsum) to thrust.

* [Serving] Fix loading presharded weights (#1894)

* [Serving] Address embedding lookup OOM issue (#1899)

This PR addresses the OOM issue that may be caused by embedding
lookup when the batch size of a prefill action is large.
Prior to this PR, a large embedding tensor will be created for
each sequence in the prefilled batch, thus may take unexpectedly
large memory when the batch size is large.

* [Model] Remove redundant `batch_forward` and move broadcast (#1900)

This PR contains four changes:

1. It removes the duplicate `batch_forward` defined in model
definitions. This function was widely used prior to our migration
to PagedKVCache, since before migration the attention codepath
of single sequence forward and batch forward differ. But since our
migration, the codepaths are unified into one, and therefore we
can safely remove most `batch_forward` functions.

2. It moves `op.ccl_broadcast_from_worker0` from model main forward
(which will be called at the beginning of prefill/decode) to embedding.
This change has two benefits. Firstly, the token ids taken by `embed`
was not broadcasted across workers, and it is possible for workers
other than 0 to have illegal token ids which is not in the range of
vocab size, and moving the broadcasting to `embed` perfectly address
this issue. Secondly, broadcasting token ids in `embed` is more
lightweight than broadcasting embeddings in `prefill`/`decode`, since
the tensor size of token ids is much smaller.

3. It adds `max_batch_size` to the config class of models, so that
they are potentially compatible with batching and MLC serve.

4. It removes the `k_cache` and `v_cache` effects from the models
that have switched to PagedKVCache support.

Randomly picked a few models (as below) to run the engine test, and
all of them are passed:

* phi-2 with tp=2,
* RedPajama with tp=2,
* stablelm with tp=2 (since stablelm does not support TP right now).

* [KVCache]Migrate Qwen2 model to PagedKVCache (#1903)

* [CI] Skip not supported quantization in model compilation test (#1904)

This PR updates the model compilation test so that it will now skip
a quantization when the model does not support.

* [Serving] Add missing header for `std::iota` (#1905)

The header `<numeric>` was missed, which may have caused build
failure on Windows. This PR adds the header.

* [Serving] Fix Model TokenEmbed function with TP (#1906)

This PR fixes a severe bug introduced by #1899.

Since #1899, we no longer copy the embedding back from worker 0
when using tensor parallelism. However, we did not synchronize
with the worker 0.

This will cause the following issue: in batch prefill, we will
continuously call TokenEmbed for multiple times. Each time, we
will copy the token ids to the `token_ids` NDArray on worker 0.
If we do not synchronize with worker 0, then it is possible that
the local token ids have been updated for multiple times, before
the first `CopyToWorker0` really starts to execute on the worker 0
side. As a result, at the time of executing the token ids copy to
worker 0, the local token ids might be wrong (by "wrong", say we
are executing the copying of seq 0's token ids, then the actual
local token ids array might have already been seq 3's token ids).

As a result, the issue will cause the batch prefill behave completely
wrong. This PR adds a synchronization with worker 0 explicitly.

* [SLM] Add support for Orion architecture. (#1883)

This is a PR for supporting [OrionStarAI/Orion-14B-Chat](https://huggingface.co/OrionStarAI/Orion-14B-Chat).

* [Model] Eliminate the reshape in embedding func (#1908)

Prior to this PR, there is a trailing reshape kernel at the end of
the embedding func. The reshape is not necessarily needed to be
as a kernel, which consumes extra time during execution. This PR
eliminates the reshape in the embedding function by updating the
signature of the embedding func, so that now it only takes the plain
1D token ids as input.

* [Pass] Low batch GEMM using GEMV-like schedule (#1769)

When batch size is small, GEMM in MLP of decode stage can be
dispatched into a specialized GEMV-like schedule to improve efficiency.
GEMM with a dynamic var in spatial axis will now be lowered into 
```python
if dyn_var <= 8:
    low_batch_gemv()
else:
    normal_gemm()
```

* Auto updated submodule references

* [Serving] Avoid unnecessary worker sync in Model (#1909)

Following up #1906, this PR removes the synchronization given it is
avoidable. We use another approach to avoid the write-after-write
issue.

The key to address the issue is to make sure the addresses to be
copied to worker 0 is not rewritten before the copy actually happens.
So we pre-allocate a large host array to hold all the token ids,
and for each sequence, we copy its token ids to the offset given
when calling TokenEmbed, so that we can make sure an address will
not be written twice before copy happens.

* [Serving][Grammar] Enhance GrammarStateMatcher to support general grammar (#1917)

* [Android] Improve perf of TIR PagedAttn kernel on Android (#1915)

* android perf

* Update kv_cache.py

* Deprecate old flow (#1928)

* Deprecate old flow

This PR deprecates the old flow.
As of today most of the efforts are centralized around the new flow
with SLM compilation. Additionally, we are bringing model definitions
through unified kv interface so we can have a single model
across all backends, server and local setting.

We kept the old flow around for a while, but it is a good
time to do the transition. All the documents are updated
to point to the new flow.

We also created a backup branch
https://github.com/mlc-ai/mlc-llm/tree/backup-before-old-flow-deprecation
for people who would like to checkout some of the old flow references.

* Remove deprecated prebuilts

* [Serving] Register the StableLM3B conversation template (#1920)

Update conversation_template.py

* Remove deprecated build.py

* [Fix] KVCache creation with call_pure_packed (#1930)

With https://github.com/apache/tvm/pull/16684 merged in, the KV
cache creation will fail when compiling models. This PR fixes the
problem by using `call_pure_packed`.

* [KVCache] Update FlashInfer PackedFunc names (#1931)

This PR updates the FlashInfer names given
https://github.com/apache/tvm/pull/16692 has been merged.

* [REFACTOR] remove tests/legacy-python (#1933)

This PR removes the folder tests/legacy-python
as a followup cleanup step of the old flow

Some of the files like compare lib are useful
and we should recover them later at mlc_llm.testing.DebugChat flow

* [REFACTOR] rename mlc_chat => mlc_llm (#1932)

This PR renames the mlc_chat pckage to the mlc_llm package
now that this is the new official flow. We also update the necessary
locations that might touch the package.

* Auto updated submodule references

* [Docs] Deprecating CUDA 11.7/11.8 support (#1939)

We have deprecated the wheel support for CUDA 11.7/11.8 due to TVM
thrust compatibility with old CUDA versions.

* [Fix] Fix KV cache call in mistral (#1938)

The latest TVM introduces the wellformedness check of the IR.
The mistral model definition breaks the wellformedness due to the
purity. This PR fixes this issue.

* [ChatModule] Remove eos_token_ids (#1940)

This PR removes the eos_token_ids from the ChatModule given it is
nowhere used actually.

* [SLM] Weight conversion with generator (#1916)

This PR enhances weight conversion so that it passes a generator
to `tvmjs.dump_ndarray_cache`. This effectively reduces the CPU
memory pressure when converting weights, especially when the total
converted weight size is close to or larger to the CPU memory size.

* [Serve] Introducing GPU sampler for CUDA (#1934)

This PR introduces the GPU sampler for CUDA only. The GPU sampler
makes use of the GPU sampling ops introduced in apache/tvm#16575.

We will follow up to benchmark the performance of the GPU sampler
over CPU sampler.

* [Serve] Constrain KV cache capacity on Metal (#1943)

This PR constrains the KV cache capacity for Metal devices to 32768,
in order to avoid large tensors in KV cache. This is because right
now Metal runtime has performance issue when running a kernel where
when some input buffer is very large, even if little of the large
buffer is accesed in the kernel.

* [CI] Add windows ci (#1942)

This PR adds windows CI.

* Auto updated submodule references

* [Fix] Fix embedding shape check in ChatModule (#1953)

This PR is a fix to address #1952.

* [Fix] Fetching the Git-LFS tokenizer files (#1954)

Prior to this PR, when running commands like
```shell
python3 -m mlc_chat chat HF://mlc-ai/gemma-7b-it-q4f16_2-MLC
```
only the binary weight files are downloaded, among all the Git LFS
files.

For models like Gemma whose tokenizer is large and also in Git LFS
file, the tokenizer files are not effectively downloaded automatically.
For example, the cloned Gemma `tokenizer.json` file has content
```
version https://git-lfs.github.com/spec/v1
oid sha256:05e97791a5e007260de1db7e1692e53150e08cea481e2bf25435553380c147ee
size 17477929
```
and this content is never realized to the actual tokenizer. This will
lead to the issue of #1913.

This PR fixes the issue by pulling all the Git LFS files that are not
binary files.

* [LogitProcessor] Add max thread awareness to logit processing kernels (#1955)

Make the kernels in `AttachLogitProcessFunc` to be aware of maximum
threads, fixing https://github.com/mlc-ai/mlc-llm/issues/1951.

Most code change is due to indentation, the main change is
changing `1024` to `tx`, where `tx` is
```
tx = 1024  # default
max_num_threads_per_block = get_max_num_threads_per_block(target)
if max_num_threads_per_block < tx:
    tx = max_num_threads_per_block
check_thread_limits(target, bdx=tx, bdy=1, bdz=1, gdz=1)
```

* [Model] Use static hidden size in mixtral scatter_output (#1959)

* Auto updated submodule references

* [CompilerFlag] Detect if FlashInfer is enabled from libinfo (#1941)

This PR supports the detection of if FlashInfer is enabled when
building TVM, so that FlashInfer won't be enabled when TVM is
not built with FlashInfer enabled.

* [Serving][Grammar] Add grammar termination as a stop condition (#1964)

* Unify schema for conversation template and embed into mlc-chat-config.json (#1965)

* [SLM] Small correction on Stablelm and Qwen2. (#1958)

* small fix

* small fix

* Update stablelm_model.py

* [Serving][Fix] Fix JSON output check in test_server.py (#1966)

`test_server::is_json_or_json_prefix` is used to check the output
is JSON or a prefix of JSON. 

It uses json.loads internally. However, json.loads (i.e. json.decode)
is token-based instead of char based. If half a token is left at the
end of the string, it cannot be matched.

This PR adds another check for the rest "half a token" if it exists.

* [Model] Migrate Mistral to use PagedKVCache (#1967)

This PR migrates the mistral model to the PagedKVCache interface
which supports sliding window attention with paged attention kernel
written in TensorIR.

We thereby introduce a `support_sliding_window` mode for KV cache,
which leaves space for supporting sliding window for any model at
runtime.

This PR tests the mistral on with both chat and serve.
The chat performance of Mistral 7B gets improvement than before,
benefitted from the paged attention implementation.

* Auto updated submodule references

* [REST] Update Rest API docs for the latest serve flow (#1972)

* [Docs][Upd] Server launch, examples for endpoints for MLC Serve

* remove v1/completions

* add api docs to rest

---------

Co-authored-by: Shrey Gupta <[email protected]>

* [Conv] Add bos_token to llama and mistral in ConvTemplateRegistry (#1970)

Since we don't have the `add_bos` field in the new Conversation
template, we should add the bos token into the
system_prefix_token_ids, so that it will be added to the
tokenized prompt.

* [Model][Serve] Add support for LLaVa model in serving engine (#1974)

This PR adds support for LLaVa-v1.5 model on the serving engine.
Use the HF weights and config from https://huggingface.co/llava-hf/llava-1.5-7b-hf.

Passing image input is supported as url (reference: https://platform.openai.com/docs/guides/vision)
Example:

```python
data = {
    "model": "dist/llava-1.5-7b-hf-q4f16_1-MLC/params/",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": "https://llava-vl.github.io/static/images/view.jpg",
                },
                {"type": "text", "text": "What does this image represent?"},
            ],
        }
    ]
}
response = requests.post("http://127.0.0.1:8000/v1/chat/completions", json=data)
print("Response body:", response.text)
```

* [Serve] Hot fix for the mixtral serving (#1975)

[Fix] hotfix for the mixtral serving

Co-authored-by: Yong Wu <[email protected]>

* [REST] REST API Deprecated (#1973)

Deleted old Rest API

- Removed rest.py
- Removed old interface/openai_api.py
- Update ChatModule to use new OpenAI Api protocol

Co-authored-by: Kartik Khandelwal <[email protected]>

* [Fix] Fix handling of non-numerical cuda arch (#1976)

In the latest gpu, cuda arch may not be integer, e.g `sm_90a`.
This fixes a few places that rely on integer parsing.

* [Serving][Grammar] Support specifying the main rule in grammar (#1982)

finish

* [Fix] Fix `MLC_MULTI_ARCH` with arch `sm_90a` (#1984)

This PR fixes the missing patch for target with `sm_90a` arch, as follow up pr of #1976.

* Fix Llama-2 and Mistral conversation template. Update ConvTemplateRegistry (#1981)

The current prompt format for Llama-2 and Mistral is not
completely correct.

This PR updates the code to strictly follow the official prompt
format for the two models. Also adds in missing conv templates
to ConvTemplateRegistry.

* [SpecDecode] Fix sampler selection. (#1971)

This PR temporarily fixes sampler selection logic for speculative
decoding. As GPU sampler support for speculative decoding is
not ready, speculative decoding will use cpu sampler.

* [Serving][Grammar] Utility to convert json schema to EBNF grammar (#1983)

This PR adds a generic utility to convert json schema, especially generated from pydantic, to EBNF grammar. This helps the grammar guided generation when we provide a json schema as the restriction.

This converter features the support of json standard indent style in the output grammar.

API:
```
def json_schema_to_ebnf(
    json_schema: str,
    *,
    indent: Optional[int] = None,
    separators: Optional[Tuple[str, str]] = None,
    strict_mode: bool = True,
) -> str:
    """Convert JSON schema string to EBNF grammar string.

    Parameters
    ----------
    json_schema : str
        The JSON schema string.

    indent : Optional[int]
        The number of spaces for each indent. If it is None, there will be no indent or newline.
        The indent and separators parameters follow the same convention as
        `json.dumps()`.

    separators : Optional[Tuple[str, str]]
        The separator between different elements in json. Examples include "," and ", ".

    strict_mode : bool
        Whether to use strict mode. In strict mode, the generated grammar will not allow
        unevaluatedProperties and unevaluatedItems, i.e. these will be set to false by default.
        This helps LLM to generate accurate output in the grammar-guided generation with JSON
        schema.
    """
    pass
```

* Auto updated submodule references

* [Fix] Fix serve model to adapt the latest Allocator signature (#1989)

PR apache/tvm#16738 updated the Allocator signature. This PR
updates the caller side accordingly.

* [Model] Use optimized group gemm for Mixtral (#1988)

* [Attn] Fix the construction of attn result merge kernel (#1995)

This PR fixes the mistake of passing wrong number of heads
to the attention result merge kernel.

* [iOS][Android] Add validation of library file for iOS and Android build (#1993)

This PR adds validation of symbols in iOS and android build.
During static library build, we need the right model_lib
for us to point to the packaged model executables.

Not doing so correctly will results in vm_load_executable not found
which is not informative.

This PR we validate the compiled model lib by dumping the global symbols
and ensure the list of model libs matches with each other.

In future we should perhaps lift the validation to mlc_llm package.

* Auto updated submodule references

* [Serve] add allocator in Storage as the upstream change (#1997)

The changes in https://github.com/apache/tvm/pull/16750
modified the signature of the Storage, this pull request updates
the caller code in mlc-llm to accommodate the new Storage
class signature. Ran into build error w/o the change.

* [Compiler] Support IPC memory and customized all-reduce kernels (#1990)

This PR introduces the IPC memory and customized all-reduce kernel
dispatches for tensor parallelism. We add a new compiler flag
`--allreduce-strategy`, which supports `"ring"`, `"one-shot"` and
`"two-shot"`. The flag defaults to `"ring"`, which means this PR
makes no difference if people do not manually change the all-reduce
strategy.

As of now the IPC-memory-backed customized all-reduce kernels are
only available on CUDA.

To enable all-reduce strategies other than "ring", here are some
example compile commands:
```python
python -m mlc_llm compile model/mlc-chat-config.json --device cuda --opt "allreduce-strategy=one-shot" -o model/lib.so
python -m mlc_llm compile model/mlc-chat-config.json --device cuda --opt "allreduce-strategy=two-shot" -o model/lib.so
```

Please be aware that, you probably also need to specify other
compiler flags, for example, like `--opt "cublas_gemm=1;allreduce-strategy=one-shot"`.

* Auto updated submodule references

* [Model] Fix the top-k TIR script for well-formedness (#2002)

This PR fixes the malformed MoE TIR scripts.

* Fix invalid use of dataflow var in sampler output (#2003)

* [Fix] Fix KV cache creation pass after nn.Module changes (#2011)

This PR corrects the assertion after latest changes in apache/tvm
that updates some nn.Module behavior.

* [iOS] Fix typo in prepare_model_lib.py (#2013)

Fix typo in prepare_model_lib.py

tar_list.append(valid_paths[ls0]) is introduced by mistake in https://github.com/mlc-ai/mlc-llm/pull/1993

* Remove unstable assertion in KV cache creation dispatch (#2017)

This particular assertion is unstable recently given the back-and-forth upstream TVM nn.Module exporter behavior.

* Auto updated submodule references

* [SLM] Qwen2 Multi-GPU support (#1985)

* Update qwen2_model.py

* fix lint issue

* fix lint issue

* fix lint issue

* more info for preshard  (#2027)

* When the pre-sharded version of a certain model is not available, the program will default back to the normal workflow without issuing any alert. Now, when someone attempts to convert to a pre-sharded model but cannot, the program will throw a warning message to inform users that it will revert to the standard model conversion process.

* format fix.

* black reformatted, i did not see any diff.

* black reformatted..

* Register stablelm-2 conversation template (#2029)

* [Serving][Fix] Fix problems in PopenServer (#2032)

This PR fixes several problems in the PopenServer:

- Add check for the server is not started and the request returns a fail
number, e.g. 502. And changed the retry time to 0.1s.

- Add a `__enter__` and `__exit__` method for PopenServer.
When the program is interrupted, using with clause (`__enter__`
and `__exit__`) can ensure the server always terminates. When
using `start()` and `terminate()`, the server  may still be staying
in the background even though the parent process ends.

* [Quantization] Skip MoE gate layer (#2012)

This PR skips quantizing the MoE gate layer.

* [Serving][Grammar] Integration of JSON schema generation (#2030)

Previous PR #1983 introduced a transformation from json schema
to BNF grammar.

This PR further integrates the grammar from json schema to the
generation pipeline, so that the engine now supports json schema
output. GrammarStateInitContexts are stored in a cache, so it will not
be created again with the same schema.

Interface:

- Python
```
@dataclass
class ResponseFormat:
    type: Literal["text", "json_object"] = "text"
    schema: Optional[str] = None
```

- Rest API
```
class RequestResponseFormat(BaseModel):
    type: Literal["text", "json_object"] = "text"
    json_schema: Optional[str] = Field(default=None, alias="schema")

class CompletionRequest(BaseModel):
    ...
    response_format: RequestResponseFormat = Field(default_factory=RequestResponseFormat)

class ChatCompletionRequest(BaseModel):
    ...
    response_format: RequestResponseFormat = Field(default_factory=RequestResponseFormat)
```

Performance:

We only tests single-batch performance now to show the overhead in latency.

- Model: `Llama-2-7b-chat-hf-q4f16_1`
- GPU: `NVIDIA GeForce RTX 3080`
- CPU: `AMD Ryzen 9 5900X 12-Core Processor`

```
JSON ON Batch=1
Average prefill tokens: 651.0000 tok/req
Average decode tokens: 499.0000 tok/req
Single token prefill latency: 0.3140 ms/tok
Single token decode latency: 8.6831 ms/tok
Prefill token throughput: 3184.8002 tok/s
Decode token throughput: 116.6039 tok/s

JSON OFF Batch=1
Average prefill tokens: 651.0000 tok/req
Average decode tokens: 499.0000 tok/req
Single token prefill latency: 0.3098 ms/tok
Single token decode latency: 8.6823 ms/tok
Prefill token throughput: 3227.8141 tok/s
Decode token throughput: 116.9251 tok/s
```

This PR also does these bug fixes / changes:
- Changed the structure of the converted grammar from schema
to avoid large amount of uncertain tokens, which caused a
performance degradation

* [Compiler] Support AUTO mode for all-reduce strategy (#2034)

This PR supports the auto mode for IPC all-reduce strategy.
It renames the strategy from `allreduce-strategy` to
`ipc-allreduce-strategy` in the compiler optimization flags. The
default RING mode is renamed to NONE mode, which, when specified,
uses nccl all-reduce without any IPC memory rewrite.

So right now to enable IPC all-reduce, the ideal way is to do
`ipc-allreduce-strategy=auto`.

* [LLaVa] Follow-up for TODOs in LLaVa model (#2010)

Llava: 1. Added base64 image support.
2. Merged as_prompt and as_prompt_list.
3. get_image_from_url uses config

* [Pipeline] Defer GPU IPC memory lowering (#2038)

This PR moves the position of GPU IPC memory lowering pass in pipeline,
so that it applies after the CUDA graph rewrite to enable CUDA graph
with the customized all-reduce kernels.

* [Model] Add missing broadcast of logit_position for multigpu (#2040)

This commit adds the broadcasting of `logit_pos` in batch prefill
for all models to avoid the logit position out-of-bound issue.

* [Preshard] apply presharding after quantization (#2039)

This change the behavior of presharding by apply presharding
after quantization. This makes the behavior consistent with or
without presharding

* [SLM] Baichuan Multi-GPU support (#2037)

This PR enables TP function of Baichuan2 model.

* Auto updated submodule references

* [Model] Skip TVMSynchronize when tracing is not enabled (#2041)

This PR removes the synchronization in `Model` when Chrome tracing
is not enabled. It can help some logit process kernels launching
earlier.

* [Serving] Support NVTX for benchmarking (#2043)

This PR supports MLC serve with NVTX which helps analyzing benchmarking
results.

**Note.** To enable NVTX, please add `set(USE_NVTX ON)` to file
`build/config.cmake`.

* Update huggingface_loader.py

* [Serve] Separate callback invocation to another thread in AsyncEngine (#2046)

This PR enhances the AsyncThreadEngine by separating the callback
invocation to another thread, in order to reduce the CPU time overhead
of invoking Python callback.

* [LLaVa] Fix random token output after first sentence (#2048)

Fix Llava random token after first '.' token

Co-authored-by: Animesh Bohara <[email protected]>

* Auto updated submodule references

* [Pass] Fix LiftGlobalBufferAlloc for proper GlobalVar struct info (#2053)

This PR fixes the GlobalVar struct info mismatch issue cased by
pass LiftGlobalBufferAlloc after a latest TVM commit.

* Auto updated submodule references

* [Serving] CLI Support for SERVE (#2014)

This PR adds CLI support for serve.

Usage:

`mlc_llm serve [Model]`

refer `mlc_llm serve -h` for more options

Comments
- Supports JIT compilation of Model lib
- Added context manager to `ServerContext` class

Co-authored-by: Ruihang Lai <[email protected]>
Co-authored-by: Shrey Gupta <[email protected]>

* [Pipeline] Insert hints to enable cuda graph symbolic capture (#2050)

* [Pipeline] Add pass to insert hints to enable cuda graph symbolic capture

* [Loader] Print message when multi-GPU loader is finished (#2051)

* [Loader] Print message when multi-GPU loader is finished

* Update multi_gpu_loader.cc

* fix

* [KVCache] Support matching arbitrary element offset for aux data (#2057)

This PR enhances the TIR attention-related functions to support
matching arbitrary element offests. This makes room for the KV cache
to allocate a large array the all the auxiliary data and do slicing
on it.

This PR should affect nothing for the current codebase, given all
the element offsets are zeros as of now.

* [Serving] Support copy stream in LogitProcessor and GPUSampler (#2058)

This PR introduces copy stream to LogitProcessor and GPUSampler
for CUDA, so that auxiliary data can be copied on a separate stream
and overlap with the computation time.

* [SLM] Stablelm Multi-GPU support (#2052)

This PR enables TP function of Stablelm model.

* [KVCache] Introducing single page copy func for KV cache fork (#2060)

This PR introduces the single page copy TIR function for KV cache.
This function is helpful for sequence fork at specified positions.

NOTE: this PR is a breaking change, so you will need to re-compile
your model and update TVM or the MLC-AI pip package to the latest.

Related PR: apache/tvm#16813

Co-authored-by: Yaxing Cai <[email protected]>

* [Python] Implement testing.DebugChat for end-to-end model debugging (#2056)

* [Docs] Fix docs for python server and rest call (#2066)

This PR updates the MLC serve documentation for server launching.

* [CI] Enable submodule clone for WASM model compilation (#2068)

The incoming WASM runtime requires 3rdparty for builds. This PR enables
the submodule clone for WASM model compilation in CI.

* [Serve] Fork sequence at specified positions (#2067)

With PagedKVCache supporting fork at a specified position, this PR
updates `Model` interface accordingly. The fork position defaults
to -1, which means the last position.

* [SLM] Add support for RWKV6 model  (#1977)

* [SLM]: Support for rwkv tokenizer

* [SLM] RWKV6 World Support

* [Quantization] Reorganize utils code in group_quantization (#2055)

* [Serving] Bugfix for empty stop string  (#2070)

add check for empty stop string; fix Vanilla LM conversation template

* [SLM] Internlm Multi-GPU support (#2072)

This PR enables tensor parallelism support for InternLM model.

* [WebGPU] Add mlc wasm runtime, support grammar in web (#2061)

* [WebGPU] Add mlc wasm runtime, support grammar in web

* Make in web for wasm ci

* Fix wasm ci

* Fix wasm ci

* Change export library arg name

* Move macro to cc instead of makefile

* [Build] Use TVM_HOME environment variable (#2073)

Prior to this commit, the `CMakeLists.txt` file checked a cmake
`TVM_HOME` variable, but did not check the usual `TVM_HOME`
environment variable.  If this variable is set, it should be used.

* [Serving] Support input chunking (#2069)

This PR supports input chunking with regard to customized
"prefill chunk size" (field `prefill_chunk_size` in
`mlc-chat-config.json`). With this PR, we can now chunk a long input
into multiples when there is an upper limit on the prefill chunk size.
Only `TokenData` is supported for now.

* [Docs] API Code Completion Guide (#2054)

* Allow "mlc_llm --host" option to override host triple the model compi… (#2074)

Allow "mlc_llm --host" option to override host triple the model compile to

* [Web] Move prep emcc deps script to web folder (#2077)

* [SLM] Qwen Multi-GPU support (#2075)

* Fix mismatch of metadata func and global symbol (#2078)

* Fix mismatch of metadata func and global symbol

* Update estimate_memory_usage.py

* [Disco] Set worker CPU affinity with env variable (#2042)

This PR enables setting the CPU affinity of disco workers in
MLC, following the support in apache/tvm#16807. The purpose is
to try reduce the CPU core switch overhead brought to disco workers
which may cause extra bubble times in disco workers before/during
tasks.

We use a macro `MLC_DISCO_WORKER_CPU_BINDING` to specify the CPU
affinities of workers. This is by default not used. To enable it,
you can run the command like

```shell
MLC_DISCO_WORKER_CPU_BINDING=64,65,66,67 python some_mlc_app.py
```

to specify the four CPU core ids for the four workers.

* [Quantization] Introduce PerTensor and F8 quantization (#2079)

* [Quantization] Introduce PerTensor and F8 quantization

* address comments

* [Serving][Refactor] Rename AsyncThreadedEngine to ThreadedEngine (#2081)

This PR renames the AsyncThreadedEngine to ThreadedEngine to
prepare for follow up refactors of Python interface. Meanwhile,
this PR exposes a creation function for AsyncThreadedEngine so that
it can be further used by others, such as JSONFFIEngine.

* [Serving] Add cuda profiling in benchmark test (#2084)

* [Serving] Add cuda profiling in benchmark test

* [Grammar] Fix broken grammar tests (#2083)

This PR fixes some grammar parser tests that were broken.

* [Serving][Fix] Fix chunked prefill condition (#2082)

This PR fixes a bug when trying to chunk an input and do prefill.
The stats prior ot this PR was wrong.

* [Conversation] Fix RedPajama conversation template (#2087)

As reported and discussed in #2086, this PR fixes the RedPajama
template.

* [Serving][Refactor] Python interface refactor (#2085)

This PR is an initial major Python interface refactor of MLC Serve.

With this PR, `mlc_llm.serve` in Python now exposes two engine classes:
`AsyncEngine` and `Engine`. Both classes have two entrypoints,
`chat_completion` and `completion` which conform to OpenAI Python API
(reference: https://github.com/openai/openai-python).

As the name suggested, `AsyncEngine` works asynchronously, and `Engine`
works synchronously. It worths noting that the `Engine` since this PR
is different from the `Engine` so far. The new `Engine` does not provide
interfaces for batch generation.

For robustness and correctness, the old `Engine` in Python is moved
to `mlc_llm.serve.sync_engine.SyncEngine`. We do not directly expose
this SyncEngine, and it now mainly serves testing and debug purposes.
It is useful to check the correctness of new features, because of its
simplicity. It keeps the low-level interface to directly invoke `step()`
function of the engine, and also keeps the low-level batch generation
interface.

Our REST API entry points defined under `mlc_llm/serve/entrypoints/`
are also refactored accordingly to adapt to the latest Python API
in MLC Serve. In short, most of the logic in OpenAI API entry points
are moved to Python API, which simplifies the implementation of
entry points.

Please note that this is the first (also the largest) planned refactor.
We will follow up with some other refactors, which have smaller scopes
compared with this PR. The planned refactors include:

* provide submodule interface to align OpenAI Python package in
https://github.com/openai/openai-python
* refactor the constructor interface of `Engine`/`AsyncEngine` to
align the MLC serve CLI interface.

* [Serving] Separating ThreadedEngine creation and initialization (#2090)

This PR separates the creation and initialization of ThreadedEngine
for multi-threading use cases. So we can make sure that the
ThreadedEngine instance is created before any other operations
(such as initialization, running background loop, etc.).

* [Serving] Enhance robustness with small KV capacity (#2091)

This PR enhances the robustness, which had issue when the KV capacity
is small.

* [REST] Update REST API docs (#2092)

This updates the rest docs to use `mlc_llm serve` and also adds a quick start section.

* [DOCS] Clarify vulkan loader dependency (#2095)

This PR clarifies the vulkan loader dependecy.
Some system may not have the right vulkan loader
and we need to install them via conda.

* [SLM] Add support for Chatglm3 architecture (#2096)

This pr enable Chatglm3 model.

* [Quantization] Add OpenCL device (#2097)

This PR adds OpenCL device for weight conversion.

* [Serving] Support stream=True for Python API (#2098)

The previous refactoring PR formalizes the MLC serve Python API
but does not respect the `stream` flag properly: no matter if
`stream` is True or False, the functions always work in a streaming
style. This PR supports the non-stream case.

* [Serving][Refactor] OpenAI API Python interface alignment (#2099)

This PR aligns the Python API of chat completions and completions MLC
serve with the OpenAI Python package https://github.com/openai/openai-python.

Specifically, say we first create an engine or async engine, then
we can use entrance `engine.chat.completions.create(...)` for chat
completions.

We will add more use examples in the codebase after another few
refactors.

* [DOC] fix small python env install error (#2102)

Fixed one slight issue of tvm install: would require specify python=3.11
on the platform otherwise might encounter python not found error.

* [JSONFFIEngine] Initial implementation of JSONFFIEngine (#2101)

This PR introduces initial support for the JSONFFIEngine.
The request is supposed to be a JSON string in the
[Chat completion request body format](https://platform.openai.com/docs/api-reference/chat/create).
The output (input to the callback function provided) is a list of
JSON strings in the [Chat completion chunk object format](https://platform.openai.com/docs/api-reference/chat/streaming).

There is still functionality to be added, which will be added in follow-up PRs. 
1. Support for other input datatypes (image, etc.)
2. Applying conversation template to input
3. Function calling and tools support
4. Generation config parameters support
5. Independent text streamers for each request
6. logprobs support

---

Co-authored-by: Ruihang Lai <[email protected]>

* [Model] Use tanh approximation of GeLU in Gemma MLP (#2106)

This is in line with the implementation in the [transformers](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma/modeling_gemma.py#L183) library.
Also, the [gemma-1.1](https://huggingface.co/google/gemma-1.1-2b-it/blob/main/config.json#L10) model config.

* Auto updated submodule references

* [Quantization] Stricter checks for MoE gate (#2109)

This PR strenthens the MoE gate checks to include checking number of
experts, given the real MoE gate router layer's output feature number
is the number of experts and is usually very small.

This PR comes from a regression that there is a layer in RWKV6 that
ends with name "gate" is not for MoE at all.

* Auto updated submodule references

* [LLaVa] Fix allowed text model value in config (#2062)

* Llava support vicuna and mistral text models

* Support f32 quantization

* Lint fix

* Use preset if transformers not installed

* Rebase on main

---------

Co-authored-by: Animesh Bohara <[email protected]>

* Auto updated submodule references

* Revert "Allow "mlc_llm --host" option to override host triple the model compi…" (#2115)

This reverts commit 12ca8fdbe2a24f43bbc72241a76735dbad8c2026.

Co-authored-by: Mengshiun Yu <[email protected]>

* Revert "Auto updated submodule references" (#2117)

This reverts commit c4169d8c8a4afedd06bc9d9b99c3aa65eee4a89e
which causes CI broken.

* [Metadata] Include picojson rather than forward declaring (#2118)

This PR fixes the picojson uses in MLC that conflicts with the latest
changes on the picojson side.

* Auto updated submodule references

* Auto updated submodule references

* [Serving][Grammar] Porting the json schema converter from python to C++ (#2112)

[Serve][Grammar] Porting the json schema converter from python to C++

This PR ports the json schema converter from python to C++. It defines
the interface:
```
std::string JSONSchemaToEBNF(
    std::string schema, std::optional<int> indent = std::nullopt,
    std::optional<std::pair<std::string, std::string>> separators = std::nullopt,
    bool strict_mode = true);
```

And uses it in BNFGrammar::FromSchema.

This helps cases where python cannot be deployed.

* [Model] Use R.topk/cumsum for mixtral (#2107)

* Enable flashinfer when group_size == 6 (#2124)

* [SpecDecode] Support Eagle in speculative decoding (#2080)

1. Add Eagle-Llama-7b-chat model support.
2. Add speculative decoding support with Eagle.

* [Pass] Attach non-negative TIR var attributes (#2125)

This PR attaches the attributes of `tir.non_negative_var` for memory
planning.

* [Serving][Refactor] Engine constructor interface refactor (#2126)

This PR is a refactor of the engine's contructor interface
and the serve CLI interface.

This PR introduces the "mode" argument for engine, which has options
"local", "interactive" and "server". The choice of mode will affect
the automatically inferred value of `max_batch_size`,
`max_total_sequence_length` and `prefill_chunk_size` (only effective
when arguements are not specified. Once an argument is specified,
we will not override it). For detailed specification of the mode,
please check out the CLI help messages in `mlc_llm/help.py` or the
engine constructor in `mlc_llm/serve/engine.py`.

No matter which mode is chosen, we will print out the current mode
and the values of these arguments, for peopple to understand the
settings of the engine. We also provide hints on how to adjust the
mode. For example,

```
[2024-04-12 16:12:26] INFO chat_module.py:379: Using model folder: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16-MLC
[2024-04-12 16:12:26] INFO chat_module.py:380: Using mlc chat config: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16-MLC/mlc-chat-config.json
[2024-04-12 16:12:26] INFO chat_module.py:529: Using library model: dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so
[2024-04-12 16:12:26] INFO chat_module.py:379: Using model folder: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q4f16_1-MLC
[2024-04-12 16:12:26] INFO chat_module.py:380: Using mlc chat config: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json
[2024-04-12 16:12:26] INFO chat_module.py:529: Using library model: dist/Llama-2-7b-chat-hf-q4f16_1-MLC/Llama-2-7b-chat-hf-q4f16_1-MLC-cuda.so
[2024-04-12 16:12:29] INFO engine_base.py:382: Engine mode is "local". Max batch size is set to 4. Max KV cache token capacity is set to 4096. Prefill chunk size is set to 4096.
[2024-04-12 16:12:29] INFO engine_base.py:387: Estimated total single GPU memory usage: 21543.74 MB (Parameters: 16467.64 MB. KVCache: 4450.07 MB. Temporary buffer: 626.03 MB). The actual usage might be slightly larger than the estimated number.
[2024-04-12 16:12:29] INFO engine_base.py:398: Please switch to mode "server" if you want to use more GPU memory and support more concurrent requests.
```

After the refactor, we bring the speculative decoding to the serve
CLI so that people can use multiple models and run speculative
decoding with the server launched in CLI (which was not doable before).

* [Serving] Revamp engine mode selection logging info (#2128)

This PR revamps the logging info for engine mode selection to provide
more detailed information and the rationale of different modes.

* [SLM] Chatglm3 Multi-GPU support (#2123)

This PR enables TP for Chatglm3 model.

* [Serving] Fix support of large `n` under low max batch size (#2136)

Prior to this PR, due to the improper prefill policy on `n` (parallel
generation), the engine will loop forever when the a request has `n`
larger than the maximum batch size that the engine can support.

This PR fixes this issue by updating the prefill action, and with this
PR, even the "interactive" engine mode can well support multiple
parallel generation.

After this fix, it is possible that a request require 10 parallel
generation while the max batch size is 1. Given the shapes of temporary
NDArrays in GPU sampler is determined by the max batch size, GPU sampler
does not natively support sampling 10 tokens at a time. To approach
this issue, this PR introduces chunking to GPU sampler. Therefore,
in this particular case, the GPU sampler will have chunk size 1,
and the 10 required samples will be processed by the GPU sampler
one by one in order. Chunking is the minimum change we can do to support
large `n`.

* [Docs] Revamp landing page with Engine Python API and server (#2137)

This PR revamps the landing documentation page.

* The Python API panel is changed from showing ChatModule to showing
Engine.
* A new panel "REST Server" is added to show a quick start example
of launching REST server and send request.
* A "what to do next" section is introduced at the bottom of the
landing page.

Todo items for future PR:

* add the page of Python API with Engine.
* revamp weight conversion page.
* revamp model library compilation page.

* [Target] Update Target tags (#2141)

The commit updates the target tags, in order to identify the different
SoC hardware targets for further target-specific optimizations.

Meanwhile, update the vulkan support for int64.

* [Util] Support debug debug_compare (#2142)

* [Minor][SpecInfer] Fix Optional FC Bias for Mixtral Eagle Model (#2146)

* Add optional fc bias for mixtral.

* Fix lint.

* [Serving] fix hardcoded host and port in popen_server (#2147)

* [Docs] Introductory tutorial (#2145)

This PR updates the documentation with an introduction turorial.
The landing page now directs to the quick start page and the tutorial.

* [Serving] Support `DebugCallFuncOnAllAllWorker` and CUDA profiler (#2148)

This PR adds a new function `DebugCallFuncOnAllAllWorker` which calls
a global function of sigunature `[] -> None` on all distributed workers
when tensor parallelism is enabled (or the local session itself if not
enabled).

As the name suggests, this function is only for the debug purpose, and
we will not expose any public interface to invoke this function.

This PR also introduces the global functions
`"mlc.debug_cuda_profiler_start"` and `"mlc.debug_cuda_profiler_stop"`,
which enables CUDA profiling when using PopenServer.

* [DOCS] Update introduction (#2151)

* [DOCS] Update introduction

Some minor tweaks on the introduction doc

* Update docs/get_started/introduction.rst

Co-authored-by: Ruihang Lai <[email protected]>

---------

Co-authored-by: Ruihang Lai <[email protected]>

* [Serving][Python] Rename Engine to LLMEngine (#2152)

We rename the public Python serve interface from `Engine` to
`LLMEngine` (and from `AsyncEngine` to `AsyncLLMEngine` accordingly)
for better class name clarity.

This is because in cases people do wildcard import, in which case
the name `Engine` itself does not convey enough meaning.

* Auto updated submodule references

* [Quantization] Add e4m3 mode and enable fp8 storage type (#2154)

* [Quantization] Add e4m3 mode and enable fp8 storage type

* add quantize linear flag

* Revert "[Quantization] Add e4m3 mode and enable fp8 storage type" (#2158)

Revert "[Quantization] Add e4m3 mode and enable fp8 storage type (#2154)"

This reverts commit e9a4a0bf719a7c4fd42b438cf9e159a1e8d72590.

* [Serving] EngineConfig refactor (#2159)

This PR refactors EngineConfig for a cleaner interface of internal
Engine constructor in MLC serve. This is a preparation step towards
the engine reload/unload which will be introduced in follow-up PRs
for JSONFFIEngine functionality on mobile and other platforms.

* [Llama3] Support Llama 3 (#2163)

* Add conv template and model preset

* Fix conv template

* Trivial

* [Fix] Fix llama 3 conv template (#2164)

Fix llama 3 conv template

* Auto updated submodule references

* [Serving][HotFix] No `std::move()` for disco CallPacked (#2166)

The disco `CallPacked` function cannot handle `std::move()` very
well. A previous engine refactor PR introduced a regression that broke
our tensor parallelism support. This commit fixes the issue.

* [Docs] Update example for Llama3 (#2169)

This PR updates the huggingface repo examples to use Llama3.

* [README] Fix broken link to Python API (#2168)

* [Docs] Update README (#2170)

This PR updates README for Llama3 quick start examples.

* [Docs] Documentation of LLMEngine in Python API (#2172)

This PR completes the documentation page of LLMEngine and
AsyncLLMEngine in our Python API.

* [Docs] Update project website (#2175)

This PR mainly updates the project website, and also updates some
minor points for other docs.

* [Docs][Fix] Update index.md for jekyll failure (#2176)

This PR fixes the jekyll failure of the project website by removing the citation section (having it in README is sufficient).

* [Quantization] Add e4m3 mode and enable fp8 storage type (reland #2154) (#2161)

* [Quantization] Add e4m3 mode and enable fp8 storage type

* add quantize linear flag

* [Docs] Fix API reference not displayed (#2177)

This PR fixes the issue of the API reference not displayed in the documentation.

* [Docs] Update project website (#2180)

This PR updates the project landing website to remove some information.

* [Misc] Pass env along when calling `subprocess.run` (#2179)

The uses of `subprocess.run` in the codebase did not pass the
environment, which may cause some issues in cases.

* Change OpenAI protocol default value to None and supply using model config (#2178)

* Change OpenAI protocol default value to None and supply using model config

* Fix lint

* [Serving][Spec] Fix the output inconsistent bug of q0f32 spec decoding (#2184)

- According to https://github.com/mlc-ai/mlc-llm/issues/2167, the problem that
the output of spec decoding in q0f32 is inconsistent with the single
model of q0f32 has been fixed. 
- Modified the test_engine_generate function located in
`tests/python/serve/test_serve_engine_spec.py` to support comparison
of the output of a single model and the output of spec decoding
- The accuracy comparison with hugging face is left (because
the current version of llama-2-7b of q0f32 cannot be consistent
with the output of hugging face model)
- The output of spec decoding for q0f16 cannot be consistent
with the output of a single model of q0f16, but this may be due
to floating point errors.   

Co-authored-by: DearFishi <[email protected]>

* [Serving] Support ThreadedEngine Reload/Unload/Reset (#2185)

This PR brings the support of reload (reload the engine with a new
model), unload (unload the current running model) and reset (reset
the engine to the initial states without unloading) to
ThreadedEngine and JSONFFIEngine.

These functions are useful for app bindings for iOS/Android.

* [WASM] Support grammar schema in wasm (#2187)

* [Serving] Support loading system library (#2189)

This PR introduces the support of loading system libraries. Now in
engine reload, when the given library path starts with `"system://"`,
we recognize this as a system library and will try to load the the
library from the path after the `"system://"` prefix.

This PR also decouples the InitBackgroundEngine of ThreadedEngine
into two parts, where the reload is now called explicitly when
initializing the engine. This can be also done for the JSONFFIEngine.
However, we need to move the construction of streamers in JSONFFIEngine
before doing the same thing for JSONFFIEngine. So this is marked as
a TODO item.

* [Op] Batch verify for speculative decoding (#2186)

This PR adds batch verify for spec decode
----
Co-authored-by: Wuwei Lin <[email protected]>

* [JIT] Better organize JIT and AOT handling (#2191)

* [JIT] Better organize JIT and AOT handling

Previously we do JIT when AOT lib lookup failed.
The error message can become cryptic when JIT also fails,
it will show up as cannot find None-vulkan.dll.

This PR changes the behavior to only to lookup when model_lib_path
is provided, or only to JIT when it is not. This will leads to
cleaner error message overall.

* Windows compact

* More windows instructions

* Fix prefill and context flag names in doc (#2192)

* Update compile_models.rst

Fix flag names for prefill chunk size and context window size.

* Update compile_models.rst

* [Docs] Update quick start to mention Llama 3 8B (#2196)

This commit updates the quick start to mention Llama 3 8B instead of Llama 2 7B. The code blocks where already updated.

* [SERVING] Add Conv Template and Function Calling support to JSON FFI (#2190)

This PR adds conv template support to the JSON FFI Engine.
Also add function calling and pass stop str to generation config.

Co-authored-by: Shrey Gupta <[email protected]>

* [Serving] Paged Radix Tree for Prefix Caching (#2183)

This PR introduces the Paged Radix Tree data structure, as foundation and prerequisite of prefix caching.

* [Serving] Remove mandatory model check in server (#2195)

This PR removes the mandatory model check in server since as of now
we serve one engine at most which means there is always a unique
engine being served. As issue #2155 points out, the model check
in server can be a bad experience when the model string mismatches.

* [Sampler] Enable GPU sampler for draft verification (#2198)

* [Eagle] Attach gpu verifier to model

* WIP

* WIP

* fix

* Enable GPU verifier

* lint

* lint

* [Eagle] Make eagle disco compatible (#2197)

* [Eagle] Make BatchSelectLastHidden able to run on the controller

* [Serving][Spec] Fix normal mode verification for extra draft token (#2206)

This PR updates the draft verification of the normal mode speculative
decoding. Prior to this PR, we did not effectively leverage all the
draft tokens, and this PR fixes the issue.

* [Sampler] Prob renormalization with top p for spec decoding (#2201)

This PR introduces a renormalization interface with regard to top-p
values for speculative decoding. This is helpful for simplifying the
logic of speculative decoding verification stage, as all probs have
been already updated with the top-p values and no top-p needs to
be taken into consideration.

So for speculative decoding, we always renorm the probability
distribution before sampling/verifying. For non speculative decoding
mode, we keep using the previous flow, which applies top-p together
when sampling.

Co-authored-by: Wuwei Lin <[email protected]>

* [Python] Rename LLMEngine to MLCEngine (#2210)

This commit renames the LLMEngine to MLCEngine.

* [Fix] CUDA architecture detection bug fix (#2211)

This commit returns a list of integers and adds an assert to check that the string of CUDA architecture must contain numbers only.

Co-authored-by: msyu <[email protected]>

* [Android ] Enable OpenCL host pointer usage (#2215)

Take advantage of OpenCl host ptr that improves copy performance

* [PYTHON][KVCACHE] Enhance the thread limit for opencl (#2216)

It improves 2x time for tir based page attention for opencl adreno.

* [Serving] Support RWKV for serving  (#2111)

feat: support serving for rwkv

* [Serving] Remove `cli.model_metadata` import from engine base (#2226)

This PR removes the imports of functions in `cli.model_metadata` from
engine_base.py. The file `cli.model_metadata` is not designed for
import directly, and when importing functions from the file, it
repetitively reports warnings of

```
RuntimeWarning: 'mlc_llm.cli.model_metadata' found in sys.modules after
import of package 'mlc_llm.cli', but prior to execution of
'mlc_llm.cli.model_metadata'; this may result in unpredictable behaviour
```

* [JSONFFIEngine] Support generation config in JSONFFIEngine. Default config values to NOT_GIVEN (#2225)

* Change OpenAI protocol default value to None in JSON FFI engine

* [JSONFFIEngine] Support generation config in JSONFFIEngine. Default config values to NOT_GIVEN

* [Sampler] Fix GPU sampler behavior when batch size is 0 (#2234)

This PR adds the early exit for the GPU sampler, which ran into
GPU kernels even when the batch size is 0 prior to this commit.

The 0 batch size case can happen when parallel generation of a request
and engine preemption exists. In this case, the GPU sampler s…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants