You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In ExecuTorch today, models with multiple methods (e.g. prefill and decode) are exported as separate graphs. When lowering to a specific backend, each graph is lowered in isolation, without awareness or context of other graphs being lowered to the same backend. The problem arises when these separate graphs have shared components. In the case of a llama model with prefill and decode, linear layers in each method share the same weights and biases. Since the graphs of prefill and decode are lowered separately, shared weights and biases are copied and serialized twice in each backend payload. This results in model bloat from duplicated weights, which is a limiting factor when bringing models to production, especially on memory-constrained devices.
Requirements
The user flow for lowering and executing models should not change
Backwards Compatible
Opt-in by delegates (Delegates should see no change if they don’t implement this new feature)
Changes to the delegate APIs by the introduction of weight sharing should be reusable by program-data separation
Core metrics like Model Load Time, Inference Latency, and Binary Size should not be significantly affected
No C++ std library dependencies are introduced to ExecuTorch in this change
Goals
Provide AoT API for backends to identify the shared components across the graphs to be lowered
All the graphs to be lowered by a given backend (across partitions and methods) will be accessible to the backend before serializing the first payload
Backends can identify shared data across graphs and serialize this separately from the lowered payloads (reduce copying across lowered payloads)
Read-only shared data is accessible by the backend when initializing the lowered payloads
Shared data will be loaded on request by backends
Shared data is freeable by backends after its use
Non-Goals
While we wish to reuse design components for program-data separation, this feature will not implement program-data separation and leveraging weight sharing does not imply usage of program-data separation
Loaded shared data is not cached by the ExecuTorch runtime, meaning each request to load shared data at runtime allocates new memory for the shared data.
This does not attempt to enable sharing of weights across different backends.
Design
We propose new ahead-of-time APIs that provide backends with all the graphs across partitions and methods to be lowered. This enables backends to identify the shared components across these graphs. Additionally, we provide a blob storage service to backends to serialize data that is shared across graphs. At runtime, backends can retrieve the shared data for any further initialization. The design details are fleshed out in the Blob Storage Service here: (#8187). See sections ‘AoT: Preprocess’ and “Runtime: NamedDataMap’.
rfcRequest for comment and feedback on a post, proposal, etc.triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
2 participants
Converted from issue
This discussion was converted from issue #8121 on February 05, 2025 21:07.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
🚀 The feature, motivation and pitch
Problem
In ExecuTorch today, models with multiple methods (e.g. prefill and decode) are exported as separate graphs. When lowering to a specific backend, each graph is lowered in isolation, without awareness or context of other graphs being lowered to the same backend. The problem arises when these separate graphs have shared components. In the case of a llama model with prefill and decode, linear layers in each method share the same weights and biases. Since the graphs of prefill and decode are lowered separately, shared weights and biases are copied and serialized twice in each backend payload. This results in model bloat from duplicated weights, which is a limiting factor when bringing models to production, especially on memory-constrained devices.
Requirements
Goals
Non-Goals
Design
We propose new ahead-of-time APIs that provide backends with all the graphs across partitions and methods to be lowered. This enables backends to identify the shared components across these graphs. Additionally, we provide a blob storage service to backends to serialize data that is shared across graphs. At runtime, backends can retrieve the shared data for any further initialization. The design details are fleshed out in the Blob Storage Service here: (#8187). See sections ‘AoT: Preprocess’ and “Runtime: NamedDataMap’.
cc. @lucylq @cccclai
Beta Was this translation helpful? Give feedback.
All reactions