You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, ExecuTorch supports one file format, ‘PTE’. The PTE file contains everything required to execute the model; instructions, delegated blobs and constant weights.
If there are two PTE files based on a common model, there’s currently no way for them to share weights or other data. If a system wants to download both PTE files, those PTE files will need to duplicate data on disk. There’s a similar problem when loading them; even if there was available disk space, loading both PTE files at the same time would require duplicating the data in RAM. For very large models, this could mean duplicating gigabytes of data. On edge systems with constrained disk space and RAM, this probably isn’t possible.
We want to provide a way for backends to separate weights into multiple files.
Goals
Provide a way for multiple PTE files to share memory; both on disk, and in RAM.
Newly added infrastructure and APIs should have minimal effect on existing implementation and ExecuTorch flow.
Data separation is opt-in.
Do not complicate AoT and runtime APIs for users who do not use data separation.
Do not significantly regress load time for users of data separation.
Do not significantly increase ET runtime binary size.
Do not introduce C++ standard library dependencies to core ExecuTorch.
Non-goals
Runtime retargetability; this does not implement the case where a generic PTE file is created, and the backend used is decided at runtime based on the available hardware.
Currently, a PTE file is generated with specific backend/s in mind. E.g. a PTE file may contain a program that’s partially lowered to XNNPACK. This means the runtime environment must have XNNPACK in order to run the PTE.
Loaded external data is not necessarily cached, meaning each request to load shared data may allocate new memory. Currently, backends should manage this under the hood to realize the benefits of reduced memory from shared data.
Overview
Data separation is a proposed new feature that allows parts of the PTE file to live in separate, sharable files. Data separation majorly unblocks data sharing between separate PTE files.
Example
Note: each box is a separate file. The arrows indicate the dependency. Eg. PTE1 requires data1 and shared_data to execute.
PTE1 and PTE2 are separate models that share data. An example use case is LoRA. Multiple LoRA programs may share the same foundation weights and be optimized for different tasks eg. assistant or summarization. Here, PTE1 and PTE2 contain separate LoRA programs. ‘shared_data’ contains the foundation weights for both LoRA programs. For LLMs, foundation weights can be on the order of gigabytes. Without sharing, PTE1 and PTE2 must both hold a copy, duplicating potentially gigabytes of data.
‘data1’ and ‘data2’ may contain LoRA adapter weights. LoRA adapter weights are usually small, on the order of megabytes. The size can vary depending on the degree of fine-tuning. Having ‘data1’ and ‘data2’ in standalone files helps with deployment efficiency. LoRA adapter weights are likely in a faster deployment cadence compared to the foundation weights. Deploying a smaller file OTA is quicker and less prone to failure. If the PTE/LoRA weights are small, it’s reasonable to keep them in a single file and update them together.
Design
We propose new ahead-of-time APIs that provide backends with all the graphs across partitions and methods to be lowered. This enables backends to identify the shared components across these graphs. Additionally, we provide a blob storage service to backends to serialize data that is shared across graphs. At runtime, backends can retrieve the shared data for any further initialization. The design details are fleshed out in the Blob Storage Service here: (#8187). See sections ‘AoT: Preprocess’ and “Runtime: NamedDataMap’.
rfcRequest for comment and feedback on a post, proposal, etc.triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
2 participants
Converted from issue
This discussion was converted from issue #8118 on February 04, 2025 17:26.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
🚀 The feature, motivation and pitch
Currently, ExecuTorch supports one file format, ‘PTE’. The PTE file contains everything required to execute the model; instructions, delegated blobs and constant weights.
If there are two PTE files based on a common model, there’s currently no way for them to share weights or other data. If a system wants to download both PTE files, those PTE files will need to duplicate data on disk. There’s a similar problem when loading them; even if there was available disk space, loading both PTE files at the same time would require duplicating the data in RAM. For very large models, this could mean duplicating gigabytes of data. On edge systems with constrained disk space and RAM, this probably isn’t possible.
Note: This doc is for backend data separation. For backend weight sharing doc, please see: [RFC] Enable Weight Sharing across a single Backend
RFC (Optional)
Scope
Assumptions
Goals
Non-goals
Overview
Data separation is a proposed new feature that allows parts of the PTE file to live in separate, sharable files. Data separation majorly unblocks data sharing between separate PTE files.
Example
Note: each box is a separate file. The arrows indicate the dependency. Eg. PTE1 requires data1 and shared_data to execute.
PTE1 and PTE2 are separate models that share data. An example use case is LoRA. Multiple LoRA programs may share the same foundation weights and be optimized for different tasks eg. assistant or summarization. Here, PTE1 and PTE2 contain separate LoRA programs. ‘shared_data’ contains the foundation weights for both LoRA programs. For LLMs, foundation weights can be on the order of gigabytes. Without sharing, PTE1 and PTE2 must both hold a copy, duplicating potentially gigabytes of data.
‘data1’ and ‘data2’ may contain LoRA adapter weights. LoRA adapter weights are usually small, on the order of megabytes. The size can vary depending on the degree of fine-tuning. Having ‘data1’ and ‘data2’ in standalone files helps with deployment efficiency. LoRA adapter weights are likely in a faster deployment cadence compared to the foundation weights. Deploying a smaller file OTA is quicker and less prone to failure. If the PTE/LoRA weights are small, it’s reasonable to keep them in a single file and update them together.
Design
We propose new ahead-of-time APIs that provide backends with all the graphs across partitions and methods to be lowered. This enables backends to identify the shared components across these graphs. Additionally, we provide a blob storage service to backends to serialize data that is shared across graphs. At runtime, backends can retrieve the shared data for any further initialization. The design details are fleshed out in the Blob Storage Service here: (#8187). See sections ‘AoT: Preprocess’ and “Runtime: NamedDataMap’.
cc @mcr229, @iseeyuan, @dbort, @JacobSzwejbka, @tarun292
Beta Was this translation helpful? Give feedback.
All reactions