Releases: meta-pytorch/torchrec
v1.4.0
Breaking Change
- The organization of the TorchRec's GitHub repository has transitioned from pytorch to meta-pytorch. [#3403]
- As part of the meta/pytorch organization change, PyTorch moved under the Linux Foundation (post), while TorchRec remains within Meta. Consequently, the GitHub repository has changed to meta-pytorch/torchrec.
- The docs will be migrated to a new domain: https://github.com/meta-pytorch, with redirects from the old https://docs.pytorch.org/ domain to the new one.
New Features
Unified Benchmark
Benchmarking is absolutely essential for TorchRec, a library designed for building and scaling massive recommender systems. Given TorchRec's focus on handling enormous embedding tables and complex model architectures across distributed hardware, a unified benchmarking framework allows developers to quantify the performance implications of various configurations. These configurations include different sharding strategies, specialized kernels, and model parallelism techniques. This systematic evaluation is crucial for identifying the most efficient training and inference setups, uncovering bottlenecks, and understanding the trade-offs between speed, memory usage, and model accuracy for specific recommendation tasks.
- DLRM wrapper, DeepFM benchmarking [#3128, #3167, #3168, #3169, #3198]
- configurable and parameterization [#3430, #3448, #3450]
- unification refactoring [#3191, #3237, #3251, #3252, #3450, #3453]
- benchmark new feature support [#3206, #3221, #3224, #3225, #3231, #3227]
- benchmark for comm ops [#3435, #3436]
- representative benchmark yml, and add memory snapshot in benchmark [#3429, #3432, #3433]
RecMetrics Offloading to CPU
- Zero-Overhead RecMetric (ZORM)
We have developed a CPU-offloaded RecMetricModule implementation that removes metricupdate(),compute(), andpublish()operations from the GPU execution critical path, achieving up to 11.47% QPS improvements in production models with numerical parity at the cost of an additional 10% avg host cpu utilization. [#3123, #3424, #3428]
Resharding API
TorchRec Resharding API provides a new capability to reshard the embedding tables during training. It can be used for use cases such as manual tuning of the sharding plans during training, and provides resharding capability for Dynamic Resharding. It enables resharding of the existing sharded embedding tables based on a newer sharding plan. Resharding API accepts the changing shards compared to the current sharding plan.
- Enable Changing the # of shards for CW resharding: #3188, #3245
- ReshardingAPI Host Memory Offloading and BenchmarkReshardingHandler: #3291
- Resharding API Performance Improvement: #3323
Prototyping KVZCH (Key-Value Zero-Collision Hashing)
Extend current TBE: There is considerable effort and expertise which has gone toward enabling performance optimized TBE for accessing HBM as well as host DRAM. We want to leverage such capabilities, and extend on top of TBE.
Abstract out the details of the backend memory: The memory we use could be SSD, Remote memory tiers through back end, or remote memory through front end. We want to enable all such capabilities, without adding backend specific logic to the TBE code.
- Add configs for write dist: #3390
- Allow the ability for uneven row wise sharding based on number of buckets for zch: #3341
- Fix embedding table type and eviction policy in st publish: #3309
- add direct_write_embedding method: #3332
Change Log
- There are rare cases using VBE where one of the KJTs has the same batch size. This is not recognized as a VBE on KJT init which can cause issues in the forward pass. We initialize both output dist comms to support this: #3378
- Pipeline minor change, docstring, and refactoring: #3294, #3314, #3326, #3377, #3379, #3384, #3443 #3345
- Add ability in SSDTBE to fetch weights from L1 and SP from outside of the module: #3166
- Add validations for rec metrics config creation to avoid out of bounds indices: #3421
- add variable batch size support to tower QPS: #3438
- Add row based sharding support for FeaturedProcessedEBC: #3281
- Add logging when merging VBE embeddings from multiple TBEs: #3304
- full change log
compatability
- fbgemm-gpu==1.4.0
- torch==2.9.0
test results

v1.4.0-rc1
release cut for v1.4.0
in-sync with fbgemm-gpu release v1.4.0
in-sync with pytorch 2.9
v1.3.0
New Features
New Flavors of Training Pipelines
- Fused SDD: A new pipeline optimization schema that overlaps optimizer with embedding lookup. Training QPS gain is observed for models with heavy optimizer (e.g., Shampoo opt). [#2916, #2933]
- 2D Sharding support: common SDD train pipeline now supports 2D sharding schema. [#2929]
- PostProc module support in train pipeline. [#2939, #2978, #2982, #2999]
Delta Tracker and Delta Store
ModelDeltaTracker is a utility for tracking and retrieving unique IDs and their corresponding embeddings or states from embedding modules in model using Torchrec. [#3056, #3060, #3064, ...]
It's particularly useful for:
- Identifying which embedding rows were accessed during model execution
- Retrieving the latest delta or unique rows for a model
- Computing top-k changed embeddings
- Supporting streaming updated embeddings between systems during online training
Resharding API
TorchRec Resharding API provides a new capability to reshard the embedding tables during training. It can be used for use cases such as manual tuning of the sharding plans during training, and provides resharding capability for Dynamic Resharding. It enables resharding of the existing sharded embedding tables based on a newer sharding plan. Resharding API accepts the changing shards compared to the current sharding plan. [#2911, #2912, #2944, #3053, ...]
- Resharding API supports Table-Wise (TW) and Column-Wise (CW) resharding
- Optimizer support includes SGD and Adagrad (with Row-wise Adagrad for TW)
- Provides a highly performant API, tested on up to 128 GPUs across 16 nodes with NVIDIA A100 80GB GPUs, achieving an average resharding downtime of approximately 200 milliseconds for around 100GB of total data.
- Achieved 0.1% average downtime per reshard compared to total training time for DLRM ~100GB model.
Prototyping KVZCH (Key-Value Zero-Collision Hashing)
Extend current TBE: There is considerable effort and expertise which has gone toward enabling performance optimized TBE for accessing HBM as well as host DRAM. We want to leverage such capabilities, and extend on top of TBE.
Abstract out the details of the backend memory: The memory we use could be SSD, Remote memory tiers through back end, or remote memory through front end. We want to enable all such capabilities, without adding backend specific logic to the TBE code.
MPZCH (Multi-Probe Zero-Collision Hashing) [#3089]
- We are introducing a novel Multi-Probe Zero Collision Hash (MPZCH) solution based on multi-round linear probing to address the long-standing hash collision problem in sparse embedding lookup. The proposed solution is general, highly performant, scalable and simple.
- A fast CUDA kernel is developed to map input sparse features to indices/slots with minimum chance of collision with others under a given budget. Eviction or fallback may happen when a collision occurs. Mapped indices and eviction information are returned for the downstream embedding lookup and optimizer states update. The process only takes a couple of milliseconds per batch at training. A CPU kernel was introduced to provide good performance in the inference environment.
- A row-wise sharded ManagedCollisionModule (MCH) module is added as a part of TorchRec library that enables seamless integration with large scale distributed model training in production. No extra limit was applied for model scaling and the training throughput regression is little-to-none.
- The solution has been adopted and tested by various product models with multi-billion hash size across retrieval and ranking. Promising results were observed from both offline and online experiments.
Change Log
- ITEP (in-training emebedding pruning) changes [#2902, #2934, #2986, #3002]
- stride_per_key_per_rank related changes [#2959, #3112, #3120, #3124]
- benchmark refactoring and unification [#3107, #3104, #3082, #3096, ...]
- input kjt validator [#2963]
compatability
- fbgemm-gpu==1.3.0
- torch==2.8.0
v1.3.0-rc3
v1.3.0-rc2
bump the torchrec version, pin the torch version
v1.3.0-rc1
align with fbgemm release cut around 6/28
v1.2.0
New Features
TensorDict support for EBC and EC
an EBC/EC module can now take in TensorDict as the data input in alternative to KeyedJaggedTensor: #2581 #2596
Customized Embedding Lookup Kernel Support
NVIDIA dynamicemb package depends on an old TorchRec release (r0.7) plus a PR(#2533), refactor TorchRec embedding lookup structures to be easy to plug in a customized emb-lookup kernel: #2887 #2891
Prototype of Dynamic Sharding
Add initial dynamic sharding API and test. This current version supports EBC, TW, and Sharded Tensor. And other variants beyond those configurations (e.g. CW, RW, DTensor etc..): #2852 #2875 #2877 #2863
TorchRec 2D Parallel for EmbeddingCollection
Adding support for EmbeddingCollection modules in 2D parallel. This supports all sharding types that are supported for EC. #2737
Changelog
v1.2.0-rc3
revert #2876 and update the binary validation script
v1.2.0-rc2
version number change
v1.2.0-rc1
first release candidate of v1.2.0
