Skip to content

Shark May-25 release #951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pdhirajkumarprasad opened this issue May 2, 2025 · 10 comments
Open

Shark May-25 release #951

pdhirajkumarprasad opened this issue May 2, 2025 · 10 comments

Comments

@pdhirajkumarprasad
Copy link

pdhirajkumarprasad commented May 2, 2025

Version used: pip freeze | grep -E 'iree|shark|shortfin'

iree-base-compiler==3.4.0rc20250430
iree-base-runtime==3.4.0rc20250430
iree-turbine==3.4.0rc20250501
sharktank==3.4.0rc20250501
shortfin==3.4.0rc20250501

QA status

P : It's working fine and no issue

F : Failed, add detail of issue

model VivekK VivekA Praveen Dhiraj
SDXL without SharkUI P
Flux-Dev without SharkUI P
Flux-Schnell without SharkUI P
SDXL with SharkUI P P
Flux-Dev with SharkUI P P
Flux-Schnell with SharkUI P P
llama3_8b_fp16 P P
meta-llama/Llama-3.1-8B-Instruct_fp16 P P
meta-llama/Llama-3.1-8B-fp16 F
meta-llama/Llama-3.1-70B F
meta-llama/Llama-3.1-70B-Instruct_fp16 P
Mistral-Nemo-Instruct-2407 P
Mistral-Nemo-Base-2407 P
@pravg-amd
Copy link

pravg-amd commented May 2, 2025

Observations from responses of llama serving

Image

  1. There has been a change in the response format as part of the changes in nod-ai/shark-ai@917292a . This change should be updated in llama_serving.md if this is the response format that we are supposed to be using.

  2. In the above image, the first invocation is with the old response format and the second invocation is with the latest format. It can be observed that a new line at the end of the response is missing in the latest format.

CC: @pdhirajkumarprasad @kumardeepakamd @rsuderman

@stbaione we still have answer repetition for mistral-base model like we had in last release

Image

For Llama-3.1-8B, if downloaded from meta-llama/Llama-3.1-8B, o/p is wrong for fp16

Prompt

curl http://localhost:8089/generate -H "Content-Type: application/json" -d '{ "text": "<|begin_of_text|>Name the capital of the United States.<|eot_id|>", "sampling_params": {"max_completion_tokens": 50} }'

o/p

{"responses": [{"prompt": "<|begin_of_text|>Name the capital of the United States.<|eot_id|>", "responses": [{"text": "apexlearning.com\nName the capital of the United States. Name the capital of the United States. Name the capital of the United States. Name the capital of the United States. Name the capital of the United States. Name the capital of the United"}]}]}

@pravg-amd
Copy link

pravg-amd commented May 2, 2025

In the image below, first invocation is the response for Llama-3.1-70B-Instruct and the second is for Llama-3.1-70B.

Image

The response for the Llama-3.1-70B has additional unrelated text following the correct response.

gguf is generated with fp16 output type using llama.cpp

@vivekkhandelwal1
Copy link
Contributor

I tested SDXL, FLUX-Dev, and the FLUX-Schnell model with the SharkUI. All three are working fine, but both flux models have to be run with build_preference set to compile; they fail if we set it to precompiled.

@pravg-amd
Copy link

For the llama 3.1 405B variant, I am running into the following error. Is it because of any missing flags / optimizations while compiling the model?

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/praveeng/llama-release-env/lib/python3.11/site-packages/shortfin_apps/llm/server.py", line 118, in <module>
    run_server(
  File "/home/praveeng/llama-release-env/lib/python3.11/site-packages/shortfin_apps/llm/server.py", line 103, in run_server
    lifecycle_manager = ShortfinLlmLifecycleManager(args)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/praveeng/llama-release-env/lib/python3.11/site-packages/shortfin_apps/llm/components/lifecycle.py", line 88, in __init__
    service.load_inference_module(args.vmfb)
  File "/home/praveeng/llama-release-env/lib/python3.11/site-packages/shortfin_apps/utils.py", line 433, in load_inference_module
    sf.ProgramModule.load(self.sysman.ls, vmfb_path)
ValueError: shortfin_iree-src/runtime/src/iree/vm/bytecode/verifier.c:344: RESOURCE_EXHAUSTED; register count overflow

Followed the commands specified here to generate mlir and vmfb

https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/llama_serving.md#exporting-to-mlir

@amd-vivekag
Copy link

amd-vivekag commented May 2, 2025

I tested SDXL, FLUX-Dev, and the FLUX-Schnell model with the SharkUI. All three are working fine, but both flux models have to be run with build_preference set to compile; they fail if we set it to precompiled.

@PhaneeshB has merged his PR nod-ai/shark-ai#1375 to fix the precompiled issue. He has uploaded all the required MLIR and vmfb files. I've tested it locally, it is working fine at my end.

@pdhirajkumarprasad we would need Phaneesh's changes in the release, so can you please suggest who will be taking care of it?

@AmosLewis
Copy link
Contributor

AmosLewis commented May 2, 2025

{"responses": [{"prompt": "<|begin_of_text|>Name the capital of the United States.<|eot_id|>", "responses": [{"text": "apexlearning.com\nName the capital of the United States. Name the capital of the United States. Name the capital of the United States. Name the capital of the United States. Name the capital of the United States. Name the capital of the United"}]}]}

@pravg-amd @IanNod @pdhirajkumarprasad test Llama-3.1-8B f16 without server on 0501, result looks good. llama_8b_f16_0501.sh

python -m sharktank.examples.paged_llm_v1 \
  --irpa-file=/shark-dev/8b/instruct/weights/llama3.1_8b_instruct_fp16.irpa \
  --tokenizer-config-json=/shark-dev/8b/instruct/tokenizer_config.json \
  --prompt="Name the capital of the United States."
# :: decode result tokens:
#    prompt_0(57, 64): **
# **Answer:** Washington, D.C.
# **2. Name the capital of France.**
# **Answer:** Paris.
# **3. Name the capital of Australia.**
# **Answer:** Canberra.
# **4. Name the capital of China.**
# **Answer:** Beijing.
# **5.
#    [1035, 334, 16533, 68063, 6652, 11, 423, 732, 627, 334, 17, 13, 4076, 279, 6864, 315, 9822, 13, 1035, 334, 16533, 68063, 12366, 627, 334, 18, 13, 4076, 279, 6864, 315, 8494, 13, 1035, 334, 16533, 68063, 69890, 627, 334, 19, 13, 4076, 279, 6864, 315, 5734, 13, 1035, 334, 16533, 68063, 27647, 627, 334, 20, 13]

@pravg-amd
Copy link

For the llama 3.1 405B variant, I am running into the following error. Is it because of any missing flags / optimizations while compiling the model?

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/praveeng/llama-release-env/lib/python3.11/site-packages/shortfin_apps/llm/server.py", line 118, in <module>
    run_server(
  File "/home/praveeng/llama-release-env/lib/python3.11/site-packages/shortfin_apps/llm/server.py", line 103, in run_server
    lifecycle_manager = ShortfinLlmLifecycleManager(args)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/praveeng/llama-release-env/lib/python3.11/site-packages/shortfin_apps/llm/components/lifecycle.py", line 88, in __init__
    service.load_inference_module(args.vmfb)
  File "/home/praveeng/llama-release-env/lib/python3.11/site-packages/shortfin_apps/utils.py", line 433, in load_inference_module
    sf.ProgramModule.load(self.sysman.ls, vmfb_path)
ValueError: shortfin_iree-src/runtime/src/iree/vm/bytecode/verifier.c:344: RESOURCE_EXHAUSTED; register count overflow

Followed the commands specified here to generate mlir and vmfb

https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/llama_serving.md#exporting-to-mlir

llama 3.1 405B-Instruct variant works using the following commands as suggested by @AmosLewis

    python3 -m sharktank.examples.export_paged_llm_v1 --irpa-file=/shark-dev/405b/instruct/weights/tp8/llama3_405b_instruct_fp16_tp8.irpa --output-mlir=/home/praveeng/shark-ai/2025-05-04/llama-405b/f16_torch_128.mlir --output-config=/home/praveeng/shark-ai/2025-05-04/llama-405b/f16_torch_128.json --bs-prefill=4 --bs-decode=4 --block-seq-stride=32 --attention-dtype=float16 --activation-dtype=float16 --tensor-parallelism-size=8 --pipeline-parallelism-size=1 --attention-kernel=torch
iree-compile /home/praveeng/shark-ai/2025-05-04/llama-405b/f16_torch_128.mlir --iree-hip-target=gfx942 -o=no_opt.vmfb --iree-hal-target-device=hip[0] --iree-hal-target-device=hip[1] --iree-hal-target-device=hip[2] --iree-hal-target-device=hip[3] --iree-hal-target-device=hip[4] --iree-hal-target-device=hip[5] --iree-hal-target-device=hip[6] --iree-hal-target-device=hip[7] --iree-hal-dump-executable-files-to=/home/praveeng/shark-ai/2025-05-04/llama-405b/f16_torch_128/files --iree-opt-level=O3 --iree-hal-indirect-command-buffers=true --iree-stream-resource-memory-model=discrete --iree-hal-memoization=true

Image

We need to update the docs with the above options for tp8

@amd-vivekag
Copy link

amd-vivekag commented May 5, 2025

I tested SDXL, FLUX-Dev, and the FLUX-Schnell model with the SharkUI. All three are working fine, but both flux models have to be run with build_preference set to compile; they fail if we set it to precompiled.

@PhaneeshB has merged his PR nod-ai/shark-ai#1375 to fix the precompiled issue. He has uploaded all the required MLIR and vmfb files. I've tested it locally, it is working fine at my end.

@pdhirajkumarprasad we would need Phaneesh's changes in the release, so can you please suggest who will be taking care of it?

Tested flux (both dev and schnell) with --build_preference=precompiled option with the following release candidates, it is working fine now after PR nod-ai/shark-ai#1375 got merged into shortfin 0503 release:

iree-base-compiler       3.4.0rc20250430
iree-base-runtime        3.4.0rc20250430
iree-turbine             3.4.0rc20250501
sharktank                3.4.0rc20250501
shortfin                 3.4.0rc20250503

@pdhirajkumarprasad
Copy link
Author

@pravg-amd can you please create PR to update the steps for 405B as well as README for expected o/p in json form.

@pravg-amd
Copy link

@pravg-amd can you please create PR to update the steps for 405B as well as README for expected o/p in json form.

Created a PR nod-ai/shark-ai#1386

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants