Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add export --output-snapshot-path snap.tc, and --snapshot-path snap.tc #1465

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

mikekgfb
Copy link
Contributor

Add ability to save and restore quantized models #1032

mgschwind@mgschwind-mlt torchchat % python3 torchchat.py generate stories15M --quant torchchat/quant_config/desktop.json --prompt "once upon a time"
NumExpr defaulting to 12 threads.
PyTorch version 2.6.0.dev20241218 available.
Unabled to import torchao experimental quant_api with error:  [Errno 2] No such file or directory: '/Users/mgschwind/tc/torchchat/torchao-build/src/ao/torchao/experimental/quant_api.py'
Using device=mps 
Loading model...
Time to load model: 0.18 seconds
Quantizing the model with: {'executor': {'accelerator': 'fast'}, 'precision': {'dtype': 'fast16'}}
Time to quantize model: 0.00 seconds
-----------------------------------------------------------
once upon a time, there was a little girl named Lily. She loved to play outside in the park with her friends. One day, Lily saw a big, scary dog. She was frightened and didn't know what to do. 
Lily's friend, Timmy, saw her and asked, "What's wrong, Lily?" 
"I'm frightened of the dog," Lily said. 
Timmy said, "Don't worry, I'll call my mom and she will come to save us." 
After Timmy called, he ran to Lily's mom and told her about the scary dog. Lily's mom called the street workers to come and take the dog away. 
Lily was happy and said, "Thank you, Timmy and Timmy's mom." Once upon a time, there was a little girl named Lily. She loved to play outside in
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 199 tokens                 
Time for inference 1: 2.0419 sec total                 
Time to first token: 0.1196 sec with parallel prefill.                

      Total throughput: 97.9489 tokens/sec, 0.0102 s/token                 
First token throughput: 8.3582 tokens/sec, 0.1196 s/token                 
 Next token throughput: 103.5252 tokens/sec, 0.0097 s/token                     

Bandwidth achieved: 4.78 GB/s
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================


Warning: Excluding compile in calculations                 
      Average tokens/sec (total): 97.95                 
Average tokens/sec (first token): 8.36                 
Average tokens/sec (next tokens): 103.53 
                
mgschwind@mgschwind-mlt torchchat % python3 torchchat.py export stories15M --quant torchchat/quant_config/desktop.json --output-snap stories15-quant.tc 
NumExpr defaulting to 12 threads.
PyTorch version 2.6.0.dev20241218 available.
Unabled to import torchao experimental quant_api with error:  [Errno 2] No such file or directory: '/Users/mgschwind/tc/torchchat/torchao-build/src/ao/torchao/experimental/quant_api.py'
Using device=mps
Loading model...
Time to load model: 0.30 seconds
Quantizing the model with: {'executor': {'accelerator': 'fast'}, 'precision': {'dtype': 'fast16'}}
Time to quantize model: 0.00 seconds
-----------------------------------------------------------
Exporting model using Snapshot to /Users/mgschwind/tc/torchchat/stories15-quant.tc
 
mgschwind@mgschwind-mlt torchchat % python3 torchchat.py generate stories15M --quant torchchat/quant_config/desktop.json --prompt "once upon a time" --snap stories15-quant.tc

NumExpr defaulting to 12 threads.
PyTorch version 2.6.0.dev20241218 available.
Unabled to import torchao experimental quant_api with error:  [Errno 2] No such file or directory: '/Users/mgschwind/tc/torchchat/torchao-build/src/ao/torchao/experimental/quant_api.py'
Using device=mps 
Loading model...
Time to load model: 0.42 seconds
-----------------------------------------------------------
once upon a time, there was a boy called Jack. He was three years old and very excited. One day, when he was playing in the park, a nosy person came up to him and started pointing at something.
Jack was very curious and he stopped to stare. He saw a big pond, and he wondered what it was. Suddenly, a frog jumped out of the pond and winked at Jack. He was very surprised and the two of them started talking.
“Hey buddy, why are you in the pond? I won't hurt you,” said the frog.
Jack smiled at the frog and asked why he was there. The frog explained that he was looking for some food to feed himself. Jack told the frog that if he helped the frog, he could become friends with him and they could play together in the park.
The frog agreed and Jack carried him out of the p
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 199 tokens                 
Time for inference 1: 1.9778 sec total                 
Time to first token: 0.1744 sec with parallel prefill.                

      Total throughput: 101.1238 tokens/sec, 0.0099 s/token                 
First token throughput: 5.7332 tokens/sec, 0.1744 s/token                 
 Next token throughput: 110.3501 tokens/sec, 0.0091 s/token                     

Bandwidth achieved: 17.49 GB/s
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================


Warning: Excluding compile in calculations                 
      Average tokens/sec (total): 101.12                 
Average tokens/sec (first token): 5.73                 
Average tokens/sec (next tokens): 110.35 
                

Copy link

pytorch-bot bot commented Jan 18, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1465

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 18, 2025
@mikekgfb mikekgfb changed the title Add export --output-snapshot snap.tc, and --snapshot snap.tc Add export --output-snapshot-path snap.tc, and --snapshot-path snap.tc Jan 18, 2025
@mikekgfb
Copy link
Contributor Author

The snapshot load path may need some python imports to pull in all quantization custom ops and custom kernels that quantize.py may make available, so that they are available when a model snapshot gets reloaded. The best way may be to wholesale import all of them, rather than saving additional info in the snapshot and doing selective import, because there just aren't enough custom ops for wholesale import on reloading a snapshot prohibitive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants