Shark May-25 release #951

pdhirajkumarprasad · 2025-05-02T04:54:40Z

Version used: pip freeze | grep -E 'iree|shark|shortfin'

iree-base-compiler==3.4.0rc20250430
iree-base-runtime==3.4.0rc20250430
iree-turbine==3.4.0rc20250501
sharktank==3.4.0rc20250501
shortfin==3.4.0rc20250501

QA status

P : It's working fine and no issue

F : Failed, add detail of issue

model	VivekK	VivekA	Praveen	Dhiraj
SDXL without SharkUI		P
Flux-Dev without SharkUI		P
Flux-Schnell without SharkUI		P
SDXL with SharkUI	P	P
Flux-Dev with SharkUI	P	P
Flux-Schnell with SharkUI	P	P
llama3_8b_fp16			P	P
meta-llama/Llama-3.1-8B-Instruct_fp16			P	P
meta-llama/Llama-3.1-8B-fp16			F
meta-llama/Llama-3.1-70B			F
meta-llama/Llama-3.1-70B-Instruct_fp16			P
Mistral-Nemo-Instruct-2407				P
Mistral-Nemo-Base-2407				P

The text was updated successfully, but these errors were encountered:

pravg-amd · 2025-05-02T06:37:04Z

Observations from responses of llama serving

There has been a change in the response format as part of the changes in nod-ai/shark-ai@917292a . This change should be updated in llama_serving.md if this is the response format that we are supposed to be using.
In the above image, the first invocation is with the old response format and the second invocation is with the latest format. It can be observed that a new line at the end of the response is missing in the latest format.

CC: @pdhirajkumarprasad @kumardeepakamd @rsuderman

@stbaione we still have answer repetition for mistral-base model like we had in last release

For Llama-3.1-8B, if downloaded from meta-llama/Llama-3.1-8B, o/p is wrong for fp16

Prompt

curl http://localhost:8089/generate -H "Content-Type: application/json" -d '{ "text": "<|begin_of_text|>Name the capital of the United States.<|eot_id|>", "sampling_params": {"max_completion_tokens": 50} }'

o/p

{"responses": [{"prompt": "<|begin_of_text|>Name the capital of the United States.<|eot_id|>", "responses": [{"text": "apexlearning.com\nName the capital of the United States. Name the capital of the United States. Name the capital of the United States. Name the capital of the United States. Name the capital of the United States. Name the capital of the United"}]}]}

pravg-amd · 2025-05-02T10:45:29Z

In the image below, first invocation is the response for Llama-3.1-70B-Instruct and the second is for Llama-3.1-70B.

The response for the Llama-3.1-70B has additional unrelated text following the correct response.

gguf is generated with fp16 output type using llama.cpp

vivekkhandelwal1 · 2025-05-02T12:24:50Z

I tested SDXL, FLUX-Dev, and the FLUX-Schnell model with the SharkUI. All three are working fine, but both flux models have to be run with build_preference set to compile; they fail if we set it to precompiled.

pravg-amd · 2025-05-02T15:30:07Z

For the llama 3.1 405B variant, I am running into the following error. Is it because of any missing flags / optimizations while compiling the model?

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/praveeng/llama-release-env/lib/python3.11/site-packages/shortfin_apps/llm/server.py", line 118, in <module>
    run_server(
  File "/home/praveeng/llama-release-env/lib/python3.11/site-packages/shortfin_apps/llm/server.py", line 103, in run_server
    lifecycle_manager = ShortfinLlmLifecycleManager(args)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/praveeng/llama-release-env/lib/python3.11/site-packages/shortfin_apps/llm/components/lifecycle.py", line 88, in __init__
    service.load_inference_module(args.vmfb)
  File "/home/praveeng/llama-release-env/lib/python3.11/site-packages/shortfin_apps/utils.py", line 433, in load_inference_module
    sf.ProgramModule.load(self.sysman.ls, vmfb_path)
ValueError: shortfin_iree-src/runtime/src/iree/vm/bytecode/verifier.c:344: RESOURCE_EXHAUSTED; register count overflow

Followed the commands specified here to generate mlir and vmfb

https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/llama_serving.md#exporting-to-mlir

amd-vivekag · 2025-05-02T15:48:10Z

I tested SDXL, FLUX-Dev, and the FLUX-Schnell model with the SharkUI. All three are working fine, but both flux models have to be run with build_preference set to compile; they fail if we set it to precompiled.

@PhaneeshB has merged his PR nod-ai/shark-ai#1375 to fix the precompiled issue. He has uploaded all the required MLIR and vmfb files. I've tested it locally, it is working fine at my end.

@pdhirajkumarprasad we would need Phaneesh's changes in the release, so can you please suggest who will be taking care of it?

AmosLewis · 2025-05-02T22:37:49Z

{"responses": [{"prompt": "<|begin_of_text|>Name the capital of the United States.<|eot_id|>", "responses": [{"text": "apexlearning.com\nName the capital of the United States. Name the capital of the United States. Name the capital of the United States. Name the capital of the United States. Name the capital of the United States. Name the capital of the United"}]}]}

@pravg-amd @IanNod @pdhirajkumarprasad test Llama-3.1-8B f16 without server on 0501, result looks good. llama_8b_f16_0501.sh

python -m sharktank.examples.paged_llm_v1 \
  --irpa-file=/shark-dev/8b/instruct/weights/llama3.1_8b_instruct_fp16.irpa \
  --tokenizer-config-json=/shark-dev/8b/instruct/tokenizer_config.json \
  --prompt="Name the capital of the United States."
# :: decode result tokens:
#    prompt_0(57, 64): **
# **Answer:** Washington, D.C.
# **2. Name the capital of France.**
# **Answer:** Paris.
# **3. Name the capital of Australia.**
# **Answer:** Canberra.
# **4. Name the capital of China.**
# **Answer:** Beijing.
# **5.
#    [1035, 334, 16533, 68063, 6652, 11, 423, 732, 627, 334, 17, 13, 4076, 279, 6864, 315, 9822, 13, 1035, 334, 16533, 68063, 12366, 627, 334, 18, 13, 4076, 279, 6864, 315, 8494, 13, 1035, 334, 16533, 68063, 69890, 627, 334, 19, 13, 4076, 279, 6864, 315, 5734, 13, 1035, 334, 16533, 68063, 27647, 627, 334, 20, 13]

pravg-amd · 2025-05-04T17:38:32Z

For the llama 3.1 405B variant, I am running into the following error. Is it because of any missing flags / optimizations while compiling the model?

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/praveeng/llama-release-env/lib/python3.11/site-packages/shortfin_apps/llm/server.py", line 118, in <module>
    run_server(
  File "/home/praveeng/llama-release-env/lib/python3.11/site-packages/shortfin_apps/llm/server.py", line 103, in run_server
    lifecycle_manager = ShortfinLlmLifecycleManager(args)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/praveeng/llama-release-env/lib/python3.11/site-packages/shortfin_apps/llm/components/lifecycle.py", line 88, in __init__
    service.load_inference_module(args.vmfb)
  File "/home/praveeng/llama-release-env/lib/python3.11/site-packages/shortfin_apps/utils.py", line 433, in load_inference_module
    sf.ProgramModule.load(self.sysman.ls, vmfb_path)
ValueError: shortfin_iree-src/runtime/src/iree/vm/bytecode/verifier.c:344: RESOURCE_EXHAUSTED; register count overflow

Followed the commands specified here to generate mlir and vmfb

https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/llama_serving.md#exporting-to-mlir

llama 3.1 405B-Instruct variant works using the following commands as suggested by @AmosLewis

    python3 -m sharktank.examples.export_paged_llm_v1 --irpa-file=/shark-dev/405b/instruct/weights/tp8/llama3_405b_instruct_fp16_tp8.irpa --output-mlir=/home/praveeng/shark-ai/2025-05-04/llama-405b/f16_torch_128.mlir --output-config=/home/praveeng/shark-ai/2025-05-04/llama-405b/f16_torch_128.json --bs-prefill=4 --bs-decode=4 --block-seq-stride=32 --attention-dtype=float16 --activation-dtype=float16 --tensor-parallelism-size=8 --pipeline-parallelism-size=1 --attention-kernel=torch

iree-compile /home/praveeng/shark-ai/2025-05-04/llama-405b/f16_torch_128.mlir --iree-hip-target=gfx942 -o=no_opt.vmfb --iree-hal-target-device=hip[0] --iree-hal-target-device=hip[1] --iree-hal-target-device=hip[2] --iree-hal-target-device=hip[3] --iree-hal-target-device=hip[4] --iree-hal-target-device=hip[5] --iree-hal-target-device=hip[6] --iree-hal-target-device=hip[7] --iree-hal-dump-executable-files-to=/home/praveeng/shark-ai/2025-05-04/llama-405b/f16_torch_128/files --iree-opt-level=O3 --iree-hal-indirect-command-buffers=true --iree-stream-resource-memory-model=discrete --iree-hal-memoization=true

We need to update the docs with the above options for tp8

amd-vivekag · 2025-05-05T05:42:12Z

I tested SDXL, FLUX-Dev, and the FLUX-Schnell model with the SharkUI. All three are working fine, but both flux models have to be run with build_preference set to compile; they fail if we set it to precompiled.

@PhaneeshB has merged his PR nod-ai/shark-ai#1375 to fix the precompiled issue. He has uploaded all the required MLIR and vmfb files. I've tested it locally, it is working fine at my end.

@pdhirajkumarprasad we would need Phaneesh's changes in the release, so can you please suggest who will be taking care of it?

Tested flux (both dev and schnell) with --build_preference=precompiled option with the following release candidates, it is working fine now after PR nod-ai/shark-ai#1375 got merged into shortfin 0503 release:

iree-base-compiler       3.4.0rc20250430
iree-base-runtime        3.4.0rc20250430
iree-turbine             3.4.0rc20250501
sharktank                3.4.0rc20250501
shortfin                 3.4.0rc20250503

pdhirajkumarprasad · 2025-05-05T05:45:43Z

@pravg-amd can you please create PR to update the steps for 405B as well as README for expected o/p in json form.

pravg-amd · 2025-05-05T13:40:45Z

@pravg-amd can you please create PR to update the steps for 405B as well as README for expected o/p in json form.

Created a PR nod-ai/shark-ai#1386

ScottTodd mentioned this issue May 5, 2025

Release tracker - 3.4.0 (2025-05-05) iree-org/iree#20361

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shark May-25 release #951

Shark May-25 release #951

pdhirajkumarprasad commented May 2, 2025 •

edited by pravg-amd

Loading

pravg-amd commented May 2, 2025 •

edited by pdhirajkumarprasad

Loading

Uh oh!

pravg-amd commented May 2, 2025 •

edited

Loading

Uh oh!

vivekkhandelwal1 commented May 2, 2025

Uh oh!

pravg-amd commented May 2, 2025

Uh oh!

amd-vivekag commented May 2, 2025 •

edited

Loading

Uh oh!

AmosLewis commented May 2, 2025 •

edited

Loading

Uh oh!

pravg-amd commented May 4, 2025

Uh oh!

amd-vivekag commented May 5, 2025 •

edited

Loading

Uh oh!

pdhirajkumarprasad commented May 5, 2025

Uh oh!

pravg-amd commented May 5, 2025

Uh oh!

Shark May-25 release #951

Shark May-25 release #951

Comments

pdhirajkumarprasad commented May 2, 2025 • edited by pravg-amd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

QA status

P : It's working fine and no issue

F : Failed, add detail of issue

pravg-amd commented May 2, 2025 • edited by pdhirajkumarprasad Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pravg-amd commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vivekkhandelwal1 commented May 2, 2025

Uh oh!

pravg-amd commented May 2, 2025

Uh oh!

amd-vivekag commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmosLewis commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pravg-amd commented May 4, 2025

Uh oh!

amd-vivekag commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pdhirajkumarprasad commented May 5, 2025

Uh oh!

pravg-amd commented May 5, 2025

Uh oh!

pdhirajkumarprasad commented May 2, 2025 •

edited by pravg-amd

Loading

pravg-amd commented May 2, 2025 •

edited by pdhirajkumarprasad

Loading

pravg-amd commented May 2, 2025 •

edited

Loading

amd-vivekag commented May 2, 2025 •

edited

Loading

AmosLewis commented May 2, 2025 •

edited

Loading

amd-vivekag commented May 5, 2025 •

edited

Loading