[RFC][1/3] Enable Weight Sharing across a single Backend #8230

mcr229 · 2025-01-31T22:51:32Z

mcr229
Jan 31, 2025
Collaborator

🚀 The feature, motivation and pitch

Problem

In ExecuTorch today, models with multiple methods (e.g. prefill and decode) are exported as separate graphs. When lowering to a specific backend, each graph is lowered in isolation, without awareness or context of other graphs being lowered to the same backend. The problem arises when these separate graphs have shared components. In the case of a llama model with prefill and decode, linear layers in each method share the same weights and biases. Since the graphs of prefill and decode are lowered separately, shared weights and biases are copied and serialized twice in each backend payload. This results in model bloat from duplicated weights, which is a limiting factor when bringing models to production, especially on memory-constrained devices.

Requirements

The user flow for lowering and executing models should not change
Backwards Compatible
Opt-in by delegates (Delegates should see no change if they don’t implement this new feature)
- Design Components introduced are reusable and extendable to program-data separation ([RFC][2/3] Data Separation in ExecuTorch #8184)
Changes to the delegate APIs by the introduction of weight sharing should be reusable by program-data separation
Core metrics like Model Load Time, Inference Latency, and Binary Size should not be significantly affected
No C++ std library dependencies are introduced to ExecuTorch in this change

Goals

Provide AoT API for backends to identify the shared components across the graphs to be lowered
All the graphs to be lowered by a given backend (across partitions and methods) will be accessible to the backend before serializing the first payload
Backends can identify shared data across graphs and serialize this separately from the lowered payloads (reduce copying across lowered payloads)
- Read-only shared data is accessible by the backend when initializing the lowered payloads
- Shared data will be loaded on request by backends
- Shared data is freeable by backends after its use

Non-Goals

While we wish to reuse design components for program-data separation, this feature will not implement program-data separation and leveraging weight sharing does not imply usage of program-data separation
Loaded shared data is not cached by the ExecuTorch runtime, meaning each request to load shared data at runtime allocates new memory for the shared data.
This does not attempt to enable sharing of weights across different backends.

Design

We propose new ahead-of-time APIs that provide backends with all the graphs across partitions and methods to be lowered. This enables backends to identify the shared components across these graphs. Additionally, we provide a blob storage service to backends to serialize data that is shared across graphs. At runtime, backends can retrieve the shared data for any further initialization. The design details are fleshed out in the Blob Storage Service here: (#8187). See sections ‘AoT: Preprocess’ and “Runtime: NamedDataMap’.

cc. @lucylq @cccclai