Replies: 6 comments 20 replies
-
@iseeyuan Do you think it would be possible to get to a point where we can meet model definitions where they are - even if we're maybe getting 50%-80% of theoretical peak performance? The reason I ask is that there is a significant burden involved in writing the model the way that ET expects. If we could provide a "peak performance path" and an "out of box / just works" path, that would be very nice. My experience with some internal teams is that they try to run existing language models on ET and drop it when the out of box performance is far behind what they expect. Aside from that, I think standardizing the language model APIs will be a big win for usability. Thanks for putting this together. |
Beta Was this translation helpful? Give feedback.
-
meta comment: you need to format the RFC a bit. Indentation and spacing is off that it makes it a hard read |
Beta Was this translation helpful? Give feedback.
-
@iseeyuan I feel that you should split this RFC into two. One for model architecture definition and one for export_llama_lib refactoring. |
Beta Was this translation helpful? Give feedback.
-
To what extent is this true? If someone doesn't have permission to rewrite the modeling code, does that mean the model won't work for that backend at all? Or will it still work but just not achieve the best performance? Maybe @kimishpatel @digantdesai @cccclai can comment about it?
I agree that we should provide tools to make source code rewriting easier. However, I don’t think rewriting should always be done in the original modeling code, as this could impact dozens of models and make OSS contributions increasingly difficult as more models are covered. (This is how I interpret the proposal, given its goal of unifying code and reducing boilerplate.) For example, as a user, if the code I modify could affect the performance of numerous models and use cases, I’d be hesitant to make changes and would likely defer to ET developers instead. This not only places us at the center of enablement and improvement but also increases the risk of making contributions more intimidating.
Many of the proposed ideas already exist in HF. For example, the ability to add and register different attention implementations is already supported (pointer). Additionally, the lifted cache is already exported as IO in Exported IR (example). My impression is that this proposal is leaning toward consolidating HF Transformers' definitions and own it in our repo, aiming to support as many transformer models as possible—including text, audio, and vision transformers. Can this approach scale effectively? One of the core principles HF Transformers upholds is "single model, single file" (as mentioned at PTC 2024). I believe they are fully aware of the downside of this approach—namely, redundant code—but it provides significant flexibility in isolating performance impacts across models and reduces the complexity of performance testing. So far, this strategy has proven highly successful.
I want to second this. Some ML engineers who just want to prototype quickly in Python shouldn’t need to be aware of the runtime code (C++). Take HF workflow as an example—good UX means an ML engineer should be able to experiment with different models and recipes, validating end-to-end in Python without needing any knowledge of the underlying runtime. This requires the interface to runtime(s) to be not only backend-agnostic but also model-agnostic.
Back to the key problem highlighted in this proposal, having multiple modeling sources in our repo is indeed a challenge, but is having multiple modeling sources itself a problem? I see these as two distinct issues, and the latter doesn’t seem avoidable—it will happen somewhere regardless.
You mean decoder-only transformers, right? What about encoder-only transformers (like BERT) and encoder-decoder transformers (like T5)? What’s the plan for non-transformer models, such as diffusion models or Timm models? If we’re heading down this path, I think we need to consider the full picture. Q: Should the ExecuTorch repo serve as a recipe repository? If so, how many recipes do you expect to host in the ExecuTorch repo? This proposal seems to imply that the ExecuTorch repo will also function as a recipe repository. I agree that providing a default recipe for each backend makes sense. However, that alone doesn’t justify the need to host these recipes within ExecuTorch. Some of the proposed ideas, such as controlling recipes via a configuration file, are already well-supported by Hugging Face not just for eager, but also for ONNX and TFLite. Why is it necessary to rebuild a similar mechanism and maintain it in our repo? From the perspective of building a vibrant community, I think it is key that recipes are separated from the core. While we can offer a default recipe for each backend as an option, we shouldn’t restrict users to copying and customizing them for their own needs. To encourage organic community growth, users should be able to create as many recipes as they want and make recipes shareable so that other OSS users can benefit. This level of openness wouldn’t be possible if recipes were tightly coupled in our repo. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the feedback! I'm planning to revamp the RFC to highlight:
|
Beta Was this translation helpful? Give feedback.
-
Okay, Updated the discussion to V2. Thanks again for your comments. Please take a look and let me know if you have further comments! cc @GregoryComer @kimishpatel @cccclai @digantdesai @guangy10 @larryliu0820 @jackzhxng |
Beta Was this translation helpful? Give feedback.
-
tl;dr
The goal of this RFC is to streamline the end-to-end on-device LLM deployment flow via ExecuTorch. Potential users:
An LLM developer or hobbyist: I want to quickly put an LLM on a device and tune the accuracy/performance.
An app developer: I want to efficiently integrate an LLM to my Android or iOS app.
Some successful metrics of this project:
Time to deploy a new LLM model is competitive to other LLM deployment frameworks.
Users find ExecuTorch useful with the stack/tools and advantages to run on different HW backends in Android and iOS.
Usage and adoption of ExecuTorch is increased.
LLM developers and hobbyists use ExecuTorch in their day to day development.
App developers integrate ExecuTorch to their production apps.
Nogoal: This project is not about deploying an arbitrary non-LLM model via ExecuTorch. However, it can be a critical part of “ExecuTorch just works”.
Context
Popular LLMs share similar transformer based architectures. The fixed architecture brings some convenience on deployment. An example is llama.cpp. However, when deploying to a variety of backends, the flows can be different due to different limitations of the backends. Those limitations include
Static shapes vs. dynamic shapes
Static quantization vs. dynamic quantization
Data types a backend supports
Kernels available in a backend
Different types of the attention layer
How to handle the KV Cache
Exploded code copies
Sometimes, updating the exporting recipe is not sufficient or efficient. Supporting a specific backend may involve a different copy of model definition, and a different version of runtime code. On the other hand, we can see the potential trend of scale due to more use cases and models to be supported. Adding those new models up, and multiplying them with the number of backends, and use cases to support, the final versions can explode.
Below is a table to summarize those existing code versions, their unique properties and use cases.
Slow deployment to devices
For each model, the journey from the source code to on-device deployment is slow. Users may feel difficulties to understand the details of each step: export, dynamic shapes, quantization schemes, partitioners in delegates, custom ops, runtime builds, runners in Android/iOS, profiling process etc.
Redundant work of enabling new models
When the number of new models scale, there can be redundant work in the ExecuTorch deployment flow, for the same or similar architecture.
RFC
Types of new LLM models
There are three types of “new” LLM models from the inference point of view:
There’s only weight and configuration change. For example, From Llama3.1 to Llama3.2, and to other models like Qwen, Phi-3, etc.
There is a significant difference in some components. For example, the DeepSeek models use Mixture of Expert (MoE) on FFN. Multi-head Latent Attention (MLA) is used vs. Multi Head Attention (MHA). However, the popular varieties are limited.
A new model is composed of a transformer model(s) and other parts. For example, the vision encoder, CLIP, is composed of transformers and other embeddings and layer norms.
We are exploring efficient solutions for those types. The high-level design thoughts are:
Reuse as much as possible existing flows.
Hide the implementation details, but expose the necessary configurations to users.
Entry point of export_llm
The interface can be as simple as:
python -m export_llm --check_point="ckp.pt" --model_config="model_config.json" --export_config="export_config.yaml"
# Alternatively, directly downloading from Hugging Face
python -m export_llm --hf_id="meta-llama/Llama-3.2-1B" --export_config="export_config.yaml"
The check_point and model_config arguments are the same as existing export_llama. Differences are:
It can be extended to other LLMs (other than Llama).
The export configs are aggregated to one file, instead of a long list of arguments. The content of this yaml file should be user-facing, like target backend, quantization scheme, etc.
[To explore] The output artifacts should be both .pte model file, and optionally a runtime binary/app that can directly run on Android and iOS devices, and cache.
Hide all the cmake options, which are related to export options like DEXECUTORCH_BUILD_XNNPACK, DEXECUTORCH_BUILD_KERNELS_CUSTOM, etc.
The benchmark results can be obtained immediately with a flag in export_config.yaml.
Below are thoughts on implementation details that are not exposed directly to users.
Eager mode definition
Note: With the improvements of export capabilities and backend support, ideally an arbitrary eager mode definition can be exported and lowered to any backend. However, we don’t see the feasibility in the near future
To handle new models in type 1-3, there can be model definitions maintained by ExecuTorch.
It’s clean for further export recipes.
Variety implementations are hidden from the users. For example, the popular attention implementations. If necessary, we can also keep a copy of definition for a type of backends. For example, attention with static shapes for QNN and ANE.
It’s easy to compose a new model based on the definition components.
.
Provide tools to help the source code transforms. Options are listed below:
Weight mapping. For example, using torchtune utils to convert HF safetensor weights or torchtune format to PyTorch checkpoint. Example here.
Convert the configs: Qwen, deepseek, or easy ways to build those models using the existing components.
Source-level transforms. Good for code unification; Not straightforward for readability. Cannot be used in all situations like different APIs.
Open questions
How is QAT handled?
users may want to do QAT on torchtune models since the infra is set up there.
If it’s eager-mode QAT (weight only) we can do a transform to the QAT submodule.
PT2E QAT: has to be in ET definition. Should we set up the QAT flow based on the ET transformer?
Export recipes
export_llama looks over-complicated with all the command options to handle different quantization schemes, different backends, etc. Inspired by our internal ModAI tool, as well as torchtunes configuration structure based on hydra:
[RFC] Have one configuration file/recipe for each backend.
What’s the format to host this recipe, is it a python script or a yaml file? A config file may have the advantage of simplicity (users don’t have to know the implementation details) and better version control, but may introduce more effort to maintain.
What’s the granularity of the recipes? If all configs are decoupled from the implementation, it may be more reasonable to have one implementation, like export_llama_cpu, but multiple config yamls for each target use case, like different quantization group sizes.
Modularize the code like checkpoints, quantization, etc.
Runtime
Runtime is another user-facing entry point. It’s deployed to Android or iOS, with the capability of loading and running the LLM models. The LLM model artifacts (.pte files) can be from
downloaded from hf.com/executorch-community
using optimum-executorch file
executorch.export API
Runtime codes may need more simplification and unification, due to the complexity of maintaining and building the C++ codes.
[RFC] Runtime code should be as backend-agnostic as possible. Some features should be modularized and a library of those features to be provided.
Runtime code should be simple. The complex logic should be put into the model if possible, for reasons below:
Scalability: C++ codes reusable through operators. Don’t need to maintain multiple C++ files for multiple models.
Portability: no need to sync C++ files in two repos in dev stage (like in ExecuTorch and torchchat).
Better UX: it’s easier for users to integrate the model inference to their use cases and less error prone.
To accelerate the development efficiency, python binding of the runtime APIs would be provided. Users can call runtime components where python is available, like on Macs for development purpose.
There is a strong need for runtime components. For example,
KV Cache management. We should modularize those and provide APIs to access KV Caches.
When the user logic gets more complicated, a local data container to efficiently and safely store/retrieve the data would be necessary. Our existing MLDW may help here.
Tokenizer
On-device deployment
The on-device deployment (to Android and iOS) should be fast.
A testing binary or iOS app can be built with minimum user interaction. (Can we do it in the same entry point of export_llm?)
It would be easy for users to quickly integrate ET with their own app or SDK.
Hansong’s Android Roadmap 25H1, the developer experience session would help for LLM on-device deployment.
Evaluation and Benchmark
Nothing new here, but there’s still a gap to achieve this:
The benchmark information should be easy to obtain with a configuration.
The benchmark data should be easy to understand, like the performance bottleneck and the hierarchical structure of the model.
Hansong’s OSS Android Benchmarking: minibench and microbench is on ET in general, it would be great to have LLM specific benchmark, similar to the internal one.
Beta Was this translation helpful? Give feedback.
All reactions