Skip to content

v1.4.0

Latest

Choose a tag to compare

@TroyGarden TroyGarden released this 07 Dec 18:25
· 108 commits to main since this release

Breaking Change

New Features

Unified Benchmark

Benchmarking is absolutely essential for TorchRec, a library designed for building and scaling massive recommender systems. Given TorchRec's focus on handling enormous embedding tables and complex model architectures across distributed hardware, a unified benchmarking framework allows developers to quantify the performance implications of various configurations. These configurations include different sharding strategies, specialized kernels, and model parallelism techniques. This systematic evaluation is crucial for identifying the most efficient training and inference setups, uncovering bottlenecks, and understanding the trade-offs between speed, memory usage, and model accuracy for specific recommendation tasks.

RecMetrics Offloading to CPU

  • Zero-Overhead RecMetric (ZORM)
    We have developed a CPU-offloaded RecMetricModule implementation that removes metric update(), compute(), and publish() operations from the GPU execution critical path, achieving up to 11.47% QPS improvements in production models with numerical parity at the cost of an additional 10% avg host cpu utilization. [#3123, #3424, #3428]

Resharding API

TorchRec Resharding API provides a new capability to reshard the embedding tables during training. It can be used for use cases such as manual tuning of the sharding plans during training, and provides resharding capability for Dynamic Resharding. It enables resharding of the existing sharded embedding tables based on a newer sharding plan. Resharding API accepts the changing shards compared to the current sharding plan.

  • Enable Changing the # of shards for CW resharding: #3188, #3245
  • ReshardingAPI Host Memory Offloading and BenchmarkReshardingHandler: #3291
  • Resharding API Performance Improvement: #3323

Prototyping KVZCH (Key-Value Zero-Collision Hashing)

Extend current TBE: There is considerable effort and expertise which has gone toward enabling performance optimized TBE for accessing HBM as well as host DRAM. We want to leverage such capabilities, and extend on top of TBE.
Abstract out the details of the backend memory: The memory we use could be SSD, Remote memory tiers through back end, or remote memory through front end. We want to enable all such capabilities, without adding backend specific logic to the TBE code.

  • Add configs for write dist: #3390
  • Allow the ability for uneven row wise sharding based on number of buckets for zch: #3341
  • Fix embedding table type and eviction policy in st publish: #3309
  • add direct_write_embedding method: #3332

Change Log

  • There are rare cases using VBE where one of the KJTs has the same batch size. This is not recognized as a VBE on KJT init which can cause issues in the forward pass. We initialize both output dist comms to support this: #3378
  • Pipeline minor change, docstring, and refactoring: #3294, #3314, #3326, #3377, #3379, #3384, #3443 #3345
  • Add ability in SSDTBE to fetch weights from L1 and SP from outside of the module: #3166
  • Add validations for rec metrics config creation to avoid out of bounds indices: #3421
  • add variable batch size support to tower QPS: #3438
  • Add row based sharding support for FeaturedProcessedEBC: #3281
  • Add logging when merging VBE embeddings from multiple TBEs: #3304
  • full change log

compatability

  • fbgemm-gpu==1.4.0
  • torch==2.9.0

test results

image