Skip to content

FBGEMM_GPU v0.6.0

Compare
Choose a tag to compare
@spcyppt spcyppt released this 31 Jan 19:40
· 969 commits to main since this release

Release Note

Highlights

  • Improvement and bug fixes for TBE variable batch size
  • Many TBE extensions and benchmarks
  • Enhanced TBE pipeline prefetching for UVM caching
  • Code refactoring and reorganization for faster builds
  • Many improvements and new sparse ops added
  • Improved low precision ops
  • Support for Python 3.12
  • PyTorch 2 support for various operators

Software Requirements

FBGEMM_GPU v0.6.0 has been tested and known to work on the following setups:

  • PyTorch: v2.2
  • CUDA: v11.8, 12.1
  • Python: v3.8, 3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.6.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.6.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

  • [Improvement] Extended support and bug fixes for variable batch size (#2012, #2043, #2107, #2150, #2188)
  • [Improvement] caching and cache lookup for pipeline prefetching (#2147, #2154, #2151)
  • [New] Support MTIA device type in FBGEMM TBE training (#1994)
  • [New] Enable sequence TBE CPU via AVX (#2195)
  • [New] Enable subwarp only for unweighted (#2051)
  • [New] Add meta functions (#2094, #2102)
  • [New] Add reverse qparam option for MTIA (#2109)
  • [New] uvm_cache_stats for direct mapped (#1951, #1952)
  • [Improvement] use memcpy for cpu emb inplace update (#2166)
  • [Improvement] Remove indices and offsets copying from prefetch (#2186)
  • [Improvement] Improve perf for L=0 cases for TBE v2 (#2046)
  • [Improvement] General fixes and enhancements (#2030, #2009)

Jagged Tensor Operators

  • [Improvement] Fix incorrect SymInt signature on dense_to_jagged (#2039)
  • [Improvement] Fix non-contiguous tensor problem in jagged_index_select (#2060, #2061)

Index Select Operators

  • [Improvement] Get total D from CPU buffer in batch_index_select_dim0 (#2079)

Low-precision operators

  • [New] Add BF16 in padded FP8 quantize ops (#2010)
  • [Improvement] Improve quantize_comm error message (#2018)
  • [Improvement] Fix illegal memory access error and initialize empty values on fp8 quantize kernel (#2131, #2176)

Pooled Embedding

  • [New] Add permute_duplicate_pooled_embeddings op for CPU (#1939)
  • [Improvement] Use PyTorch's p2p access enable function (#2000)
  • [New] Add support for duplicate in permutations for permute_pooled_embs_split (#1940)
  • [Improvement] Improve all_to_one error message (#2019)
  • [New] Add meta function for fbgemm::merge_pooled_embeddings operator (#2069)
  • [New] Add variable batch per feature support to EBC (tw/cw only) (#1986)

Misc

Benchmarks / Tests

  • [New] Benchmark block_bucketize_sparse_features uneven sharding (#2140, #2169)
  • [New] Add unit test for unique cache lookup (#2160)
  • [New] Add autogenerated opcheck tests (#2050, #2069, #2073, #2092, #2118, #2139, #2152, #2173, #2193)
  • [New] Add test for fbgemm ops. (#2136, #2082)
  • [Improvement] Modified TBE testbench to use FBGEMM generate_rquests function to generate indices and offsets (#1882)
  • [Improvement] Remove FP64 from TBE CPU tests (#2049)
  • [Improvement] Add warmup_runs to TBE benchmarks and run at least 1 warmup iter #2163
  • [Improvement] Add --pooling in TBE nbit_cpu benchmark (#2200)
  • [Improvement] Fill embedding tables with randomized scales and bias in split-TBE benchmarks (#2031)

Build / CI improvements and Fixes