Skip to content

Thread raw embedding streamer to dram_kv_embedding_cache#5432

Closed
chouxi wants to merge 42 commits intopytorch:mainfrom
chouxi:export-D94431329
Closed

Thread raw embedding streamer to dram_kv_embedding_cache#5432
chouxi wants to merge 42 commits intopytorch:mainfrom
chouxi:export-D94431329

Conversation

@chouxi
Copy link
Contributor

@chouxi chouxi commented Feb 26, 2026

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2404

Thread RES (Raw Embedding Streaming) parameters through the DRAM KV
embedding cache constructor chain and pybind to enable streaming
for the embedding cache enrichment path.
Currently the feature is gated by enable_raw_embedding_streaming
Key changes:

  • Thread 6 RES params (DramKVEmbeddingCache -> wrapper -> pybind -> Python)
  • Make raw_embedding_streamer_ protected for subclass access

Differential Revision: D94431329

spcyppt and others added 30 commits February 6, 2026 17:54
Summary:
Pull Request resolved: pytorch#5381

X-link: https://github.com/facebookresearch/FBGEMM/pull/2360

Add unit test that takes TBE configurations loaded
from JSON files. This enables testing with production TBE configurations extracted from training logs.

Changes:
- Add `extract_specs_from_log.py` script to extract TBE configurations from training logs into a structured JSON format
- Add helper functions to `test/tbe/common.py` for parsing config values:
  `parse_sparse_type`, `parse_pooling_mode`, `parse_embedding_location`,
  `parse_compute_device`, `parse_cache_algorithm`, `create_tbe_from_config`,
  `load_tbe_configs_from_file`
- Add `test_backward_adagrad_from_config_file` test that loads TBE configs
  from JSON and runs backward adagrad validation
- Add `input_tbe_specs.json` as a test resource with sample TBE configurations

Reviewed By: q10, gchalump

Differential Revision: D92390199

fbshipit-source-id: f25c9981fafdb1a6e2661bb5ce9b025d4da6995d
…#5376)

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2364

This PR fixes a bug in floating point specialized store function and adds runtime check to skip optimized kernel in case `emb_t != grad_t`. Support for mixed precision is planned

Pull Request resolved: pytorch#5376

Reviewed By: cthi

Differential Revision: D92419356

Pulled By: q10

fbshipit-source-id: 108ed99664464e3b21c477f555024fe159ef882e
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2363

Pull Request resolved: pytorch#5387

- Update OSS build script to include test packages setup

Reviewed By: henrylhtsang

Differential Revision: D92550346

fbshipit-source-id: 8ed1928505aba212b433eddf35da993f2d136b9c
Summary:
Pull Request resolved: pytorch#5385

This PR uses std::tuple/std::pair for shorted code, use loop bindings and emplace_back for clearer code and performance.

Pull Request resolved: pytorch#5384

Reviewed By: henrylhtsang

Differential Revision: D92545407

Pulled By: q10

fbshipit-source-id: c08bfd7c0b9738aab31647edacbfb269de1c5af8
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2361

Pull Request resolved: pytorch#5382

- Migrate FBGEMM assert to FBGEMM_CHECK, pt 1

Reviewed By: sryap, cthi

Differential Revision: D92443156

fbshipit-source-id: 8d8701d7f785e9d2f653eaf79c60fe600d392555
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2369

X-link: meta-pytorch/MSLK#140

- Update OSS build script to include test packages setup

Reviewed By: cthi

Differential Revision: D92733208

fbshipit-source-id: fbe1fb042a94da08f73eff1c7c37a776d7675b37
Summary:
Pull Request resolved: pytorch#5394

This PR moves "-Wno-unused-command-line-argument" to clang to avoid build failures.

Pull Request resolved: pytorch#5390

Reviewed By: cthi

Differential Revision: D92645775

Pulled By: q10

fbshipit-source-id: cda310a0794cf1f76fc462cc0b9263be341ad721
Summary:
Pull Request resolved: pytorch#5395

X-link: https://github.com/facebookresearch/FBGEMM/pull/2370

This PR applies more modernization fixes for C++ 17/20.

Pull Request resolved: pytorch#5391

Reviewed By: cthi

Differential Revision: D92645749

Pulled By: q10

fbshipit-source-id: 5914332bd6ea50c70c7b5d2cdd1f5ed0abd86bd1
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2372

Pull Request resolved: pytorch#5397

Three fixes to the backward_adagrad test:

1. **Fix bs_features construction**: The previous logic for building `bs_features` from `feature_table_map` was incorrect when tables were replicated across multiple features. It used manual insertion logic that was fragile and produced wrong mappings. Replaced with a direct list comprehension `[bs[t] for t in feature_table_map]` which correctly maps each feature to its corresponding table's embedding module.

2. **Fix OOM in gradcheck**: When `tbe_op` is provided, the test was proceeding into torch gradcheck which allocates significant additional GPU memory, causing OOM. Added an early return when `tbe_op` is set, and explicitly freed large tensors (`bs`, `bs_features`, `fs`, `gos`, `cc`, etc.) before gradcheck to reduce GPU memory pressure.

3. **Add TBE_CONFIG_INDEX env var**: Added support for a `TBE_CONFIG_INDEX` environment variable to select a single TBE config to test, allowing targeted debugging of individual configurations without running all configs.

4. **Fix T_ calculation**: Changed `T_ = len(xws)` to `T_ = num_features` for the `feature_requires_grad` tensor size, since its length should match the number of features, not the number of input tensors.

Reviewed By: q10

Differential Revision: D92668399

fbshipit-source-id: 0956adf01c2291fcba7b05fe39b70f669a408156
Summary:
Pull Request resolved: pytorch#5398

X-link: https://github.com/facebookresearch/FBGEMM/pull/2374

Original commit changeset: 5914332bd6ea

Original Phabricator Diff: D92645749

Reviewed By: tonykao8080

Differential Revision: D92856605

fbshipit-source-id: 4c1c6ab1e30138e2749764c747a6be1c2e5b5028
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2368

Pull Request resolved: pytorch#5392

X-link: meta-pytorch/torchrec#3753

D88706297

Reviewed By: zlzhao1104

Differential Revision: D92738272

fbshipit-source-id: 57b9b66c7b0be5536e9a5526cc44fac216b94a1e
Summary:
Pull Request resolved: pytorch#5399

X-link: https://github.com/facebookresearch/FBGEMM/pull/2375

The bucket assignment in get_bucket_id_from_ratio uses division: floor(ratio / stride) + 1. At boundary values, floating point division rounds down, placing blocks with ratio = threshold_bucket × stride into threshold_bucket.

The eviction check in evict_block uses strict less-than: overall_ratio < threshold, where threshold = threshold_bucket × stride.

Since ratio equals threshold (same floating point value), the comparison evaluates to false.

These blocks are counted by the dry run (included in acc_count) but not evicted (fail the < check).

Reviewed By: steven1327

Differential Revision: D92900276

fbshipit-source-id: 925ec04727db751990414c0c3d47a37536ca8210
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2379

Pull Request resolved: pytorch#5403

- Upgrade glibc to 2.28

Reviewed By: cthi

Differential Revision: D93138551

fbshipit-source-id: c80a14e739d1dfb138b7976edcc00e17f671b835
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2376

Pull Request resolved: pytorch#5400

- Scripts for building PyTorch

Reviewed By: cthi

Differential Revision: D92864314

fbshipit-source-id: 98db550101cdc396902079534a8c33372f141769
Summary:
Pull Request resolved: pytorch#5404

X-link: https://github.com/facebookresearch/FBGEMM/pull/2380

- Fix ROCm benchmark workflow

Reviewed By: henrylhtsang

Differential Revision: D93157777

fbshipit-source-id: c243cfb97c3f8e8e96e86b23a2f48dadf73d45e8
Summary:
Pull Request resolved: pytorch#5406

X-link: https://github.com/facebookresearch/FBGEMM/pull/2381

This PR implements algorithmic optimization for `group_index_select_or_add` kernel. The main focus is `USE_VAR_COLS=true` cases. We noticed that `member_id` search takes 1/2 of kernel latency in some cases. There are 2 assumptions:

1. Considering warp_id is in ascending order, we don't need to run the search on every iteration. It's more efficient to cache the member_id and upper_bound.
2. Binary upper bound search implementation (which seems to be a copy-paste of CPU version) is inefficient on GPU (and most likely on CPU as well for small/medium size problems)

With a minor adjustments these change might be adapted for CUDA path as well

Pull Request resolved: pytorch#5365

| Instantiation                                                | Count | Mean (us) (Before) | Mean (us) (After) | % Improvement |
|--------------------------------------------------------------|-------|---------------------|---------------------|--------------|
| group_index_select_or_add_2d_kernel<long, float, false, t... | 1920  | 18080.67            | 17002.10            | 5.98%        |
| group_index_select_or_add_2d_kernel<long, float, false, f... | 960   | 9256.85             | 9750.73             | -5.35%       |
| group_index_select_or_add_2d_kernel<long, float, true, tr... | 1920  | 3987.01             | 2225.99             | 44.18%       |
| group_index_select_or_add_2d_kernel<long, float, true, fa... | 960   | 2618.33             | 2732.18             | -4.35%       |
| index_select_scalar_cumsum_kernel<long, int, long, 256, 1... | 6100  | 7.19                | 7.12                | 0.97%        |
| index_select_scalar_cumsum_kernel<int, int, int, 256, 102... | 1300  | 9.29                | 8.60                | 7.42%        |
| TOTAL                                                        | 13160 | 4090.13             | 3720.07             | 9.04%        |
```

https://docs.google.com/document/d/1dqJQhgsz_rVlvmrO0YsJ3ZpyahMW-cFoI6rDRWXjiwg/edit?tab=t.0#heading=h.5row1qfol66k

Test Plan:
```
cd ~/fbsource/fbcode/ai_codesign/nonprod/bensonma415/scripts/D92178105

# Run on MI300, MI350, A100, H100, GB200
bash run_benchmark.sh 2>&1 | pastry
```

```
# E2E Analysis

# Before Diff
buck run //ai_codesign/nonprod/bensonma415/scripts:download_and_analyze_traces -- \
  --url "https://www.internalfb.com/ai_infra/zoomer/mast_job/overview?jobName=aps-cmf_amd_echen40_20260212_154515-5fc47d5fee&jobVersion=0&jobAttempt=0" \
  -k index_select_scalar_cumsum_kernel \
  -k group_index_select_or_add_2d_kernel \
  --csv results.csv 2>&1 | pastry

> P2186151617: https://www.internalfb.com/intern/paste/P2186151617/

# After Diff
buck run //ai_codesign/nonprod/bensonma415/scripts:download_and_analyze_traces -- \
  --url "https://www.internalfb.com/ai_infra/zoomer/mast_job/overview?jobName=aps-cmf_amd_echen40_20260212_162821-e2bad24c93&jobVersion=0&jobAttempt=3" \
  -k index_select_scalar_cumsum_kernel \
  -k group_index_select_or_add_2d_kernel \
  --csv results.csv 2>&1 | pastry

> P2186174768: https://www.internalfb.com/intern/paste/P2186174768/

Reviewed By: spcyppt

Differential Revision: D92178105

Pulled By: q10

fbshipit-source-id: af99d2e1f63c40b8dd6963735d3e9f4136ad1554
…#5408)

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2384

Pull Request resolved: pytorch#5408

- Fix kernel launcher test to work with CUDA_LAUNCH_BLOCKING=1

Reviewed By: sryap

Differential Revision: D93292796

fbshipit-source-id: 726a991a3f9dd2af027ade4667edefd6bc8beb08
Summary:
Pull Request resolved: pytorch#5405

# Context
We are currently lacking CPU-GPU parity for `rowwise_adagrad_with_counter`, and this is evidenced by the `test_backward_optimizers_adagrad_with_counter_cpu` test failing if (1) it is enabled (currently skipped) and (2) if the new and ref weights are compared with tolerance 1.0e-4. This is because `rowwise_adagrad_with_counter` is missing various regularization and weight decay features.

# The Fix
In `rowwise_adagrad_with_counter`, I made the CPU weight update logic mirror that of the GPU. After running the `test_backward_optimizers_adagrad_with_counter_cpu` unit test, I have verified that it now passes, indicating correctness for the various tested configurations. Consequently, I was able to reduce the new vs ref weight tolerance to 1.0e-4, and verified that the rest of the tests in the file still pass.

Reviewed By: spcyppt

Differential Revision: D91080112

fbshipit-source-id: 6b5a0cf4a31c8b6e41f5cf7db5ea6206e634263d
Summary:
Pull Request resolved: pytorch#5407

X-link: https://github.com/facebookresearch/FBGEMM/pull/2383

**Change to CUDA_KERNEL_ASSERT**
=================================

CUDA_KERNEL_ASSERT2 returns when the runtime flag is not passsed.

https://docs.google.com/document/d/1m0-5ana08jBZZfgp9lmbM3W1qYVf6TNOoRoXJoOVKuw/edit?tab=t.0#heading=h.e58hjdsk3ro2

**Summary**
-----------

This diff updates the `CUDA_KERNEL_ASSERT` statements in various files of the `fbgemm_gpu` library to provide more informative error messages. This change is made in five different files, each containing a separate update to the `CUDA_KERNEL_ASSERT` statement.

**Files Affected**
-------------------

* `sparse_pack_segments_forward.cu`
* `embedding_bounds_check_v1.cu`
* `gather_ranges_to_dense_v2.cu`
* `embedding_bounds_check_v2.cu`
* `sparse_index_select.cu`

Reviewed By: q10

Differential Revision: D93033489

fbshipit-source-id: 60287d12f44df4709cc64bab9f8d156062e54d30
Summary:
Pull Request resolved: pytorch#5189

X-link: https://github.com/facebookresearch/FBGEMM/pull/2187

populate_bucketized_permute is being used in request batching for seq embedding in RW

Currently, populate_bucketized_permute is parallelized by one thread per sequence, this is due to the fact that:
- This kernel is for seq embedding where the order of the ids matter
- We then need to calculate the offsets of the ids one by one serially

This diff improves the parallelism from thread to warp level using warp-level sync primitives (`__ballot_sync`/`__popc`). It assumes less than 32 shards (max warp size) which is predominantly true in serving.

There are alternatives to this approach with pros and cons, we could do prefix sum by world size to get the offset for each row and then do parallelism per index. The pros is more parallelism the cons is more kernels (we need world size - 1 of prefix sum). Given that at least for recsys models, the bottleneck is on GPU and that this approach increases parallelism by 32 folds, i leaned toward this approach

# Rollback Plan
The new warp-parallel kernel is gated behind the `BUCKETIZED_PERMUTE_WARP_KERNEL` feature gate:
- **JustKnobs**: `fbgemm_gpu/features:BUCKETIZED_PERMUTE_WARP_KERNEL` — set to `false` to disable the warp-parallel kernel and fall back to the original per-thread kernel
- **Environment variable**: `FBGEMM_BUCKETIZED_PERMUTE_WARP_KERNEL=0` (for OSS/testing)
- When the feature gate is disabled, the code falls back to the original `_populate_bucketized_permute_cuda_kernel` with no behavior change
- When `my_size > 32` (kWarpSize), the original kernel is always used regardless of the gate

# Changes addressing reviewer feedback
- Used `CUDA_KERNEL_ASSERT` with meaningful error messages
- Added bounds check for `bucketized_offsets_data` access
- Simplified `my_bucket` initialization with ternary
- Inlined `pos` variable assignment
- Added comment explaining the second `__ballot_sync` call
- Added `BUCKETIZED_PERMUTE_WARP_KERNEL` feature gate (C++ enum + Python enum)
- Added `config_cpp` dependency to `sparse_ops_gpu` BUCK target
- Added comprehensive test `test_populate_bucketized_permute_warp_parallel_vs_original` comparing GPU warp-parallel kernel against CPU reference across 6 configurations
- Set `FBGEMM_NO_JK=1` and `FBGEMM_BUCKETIZED_PERMUTE_WARP_KERNEL=1` env vars in test BUCK target

Reviewed By: spcyppt, q10

Differential Revision: D88522379

fbshipit-source-id: 2bcc0e56a9293cf6b4ff575265a1fb4639bf18f0
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2387

Pull Request resolved: pytorch#5412

- Update PyTorch OSS build scripts

Reviewed By: henrylhtsang

Differential Revision: D93157176

fbshipit-source-id: 9ca81450af86a39e9e5e845eee4e0fb58ae788eb
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2389

Pull Request resolved: pytorch#5414

Reviewed By: henrylhtsang

Differential Revision: D93662599

Pulled By: q10

fbshipit-source-id: c1d7f5b8f343778bda774158340aa32cae9220c8
Summary:
Pull Request resolved: pytorch#5251

X-link: https://github.com/facebookresearch/FBGEMM/pull/2219

This change improves the performance of tracking the deltas in TBE, mainly by
replacing DtoH copy with  {F1984231816}
with DtoD copy with async DtoH under stream_callback {F1984231839}

To achieve this, the following is added
- the pre-registered UVA buffer that's accessible from both GPU and CPU are reused every iteration
  - makes the lifetime of tensors the same to TBE makes it safe to async copy.
  - reuse the same buffer to avoid repeating allocation.
- trigger the CPU thread to async copy in raw_embedding_streamer.stream()
  - GPU ops don't wait on the D2H
- To avoid the change of the buffer overlaps with D2H copy
  - A GPU event to track the finish of the D2D copy, make the CPU thread to wait for the D2D copy finish
  - join_stream_tensor_copy_thread to trigger a blocking wait for the copy in the next iteration in case of CPU copies take too long before overwriting the pre-registered buffer.

Reviewed By: q10

Differential Revision: D86888586

fbshipit-source-id: fe1c4bc7c2af28b201bafa6557617bef634b995c
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2390

Pull Request resolved: pytorch#5416

Reviewed By: cthi

Differential Revision: D93762155

Pulled By: q10

fbshipit-source-id: 4b8b7f025b646579e397fa1c5fd9aced99d94eb0
Summary:
Pull Request resolved: pytorch#5410

X-link: https://github.com/facebookresearch/FBGEMM/pull/2385

Stop using `constexpr` with `__builtin_constant_p`.

Turns out using `constexpr` results in the expression being immediately evaluated by the frontend before any inlining happens; always returning `false` in this context. However this here was meant to be evaluated AFTER the function got inlined (in case of fixed `count` after specialization the compiler can fully promote the local array to vector register which is only possible when it can analyze with the known `memset` semantics.

Reviewed By: Nicoshev

Differential Revision: D93544122

fbshipit-source-id: 5970e4225e09c1433f7ad989276937e4cf93164f
Summary:
Pull Request resolved: pytorch#5418

X-link: meta-pytorch/MSLK#158

X-link: https://github.com/facebookresearch/FBGEMM/pull/2392

When using CUTLASS grouped GEMM for wgrad, experts that receive zero tokens produce NaN weight gradients. The root cause is in bf16bf16bf16_grouped_wgrad.cu: when total_M > 0 (at least one expert has tokens), the output buffer was allocated with at::empty (uninitialized memory). The kernel's set_stacked_kernel_args_kernel skips experts where M_sizes[i] == 0, so those experts never write to their portion of the output, leaving NaN/garbage values. This is triggered consistently in RL training with 256 experts and small microbatches, where many experts receive zero tokens per microbatch.

The fix changes at::empty to at::zeros so that zero-token experts get zero gradients instead of uninitialized memory. The cost of zeroing vs not is negligible relative to the GEMM computation.

Also adds dedicated tests for wgrad with zero-token experts across FBGEMM, MSLK, and MSL test suites. The tests use a warmup-based cache poisoning technique — running the kernel once with all experts having tokens to fill the CUDA caching allocator with a non-zero G*N*K buffer, then freeing it so the allocator reuses that dirty block for the next at::empty call. This makes uninitialized-memory bugs deterministic and reproducible in tests, rather than relying on non-deterministic allocator behavior. Tests cover both total_M == 0 (all experts zero tokens) and total_M > 0 with partial zero-token experts.

Reviewed By: cthi

Differential Revision: D93777683

fbshipit-source-id: b0445ca0dcff94663a6ed10d2c1949e017dbd760
Summary:
Pull Request resolved: pytorch#5417

X-link: https://github.com/facebookresearch/FBGEMM/pull/2391

- Fix undefined symbol error in OSS that was introduced with D88522379 (PR pytorch#5189)

Reviewed By: cthi

Differential Revision: D93773790

fbshipit-source-id: dca30fc82c34d36c4a6f73e21c5e666ff777d75f
…h#5420)

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2394

Pull Request resolved: pytorch#5420

- Disable building asmjit and fbgemm when building genai module

Reviewed By: jiawenliu64

Differential Revision: D93832742

fbshipit-source-id: e317137d2399c437a01830cc779ca65e07d616d6
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2395

- Update FBGEMM OSS build scripts

Reviewed By: cthi

Differential Revision: D93571865

fbshipit-source-id: 2db91699221195d0213e28e24fa6c971a409b0c1
Summary:
Pull Request resolved: pytorch#5424

X-link: https://github.com/facebookresearch/FBGEMM/pull/2397

There's a bug in TBE v2 kernel. Skip the test for now to unblock OSS.

Reviewed By: q10

Differential Revision: D93905146

fbshipit-source-id: afaa8b5b4d98e5c361a5f391dafc5ea5a2bdf5ee
chouxi and others added 5 commits February 20, 2026 20:18
…5423)

Summary:
Pull Request resolved: pytorch#5423

X-link: https://github.com/facebookresearch/FBGEMM/pull/2396

Original commit changeset: fe1c4bc7c2af

Original Phabricator Diff: D86888586

The culprit diff pulls in the cuda deps to the raw_embedding_streamer target which was exported via the split_table_batched_embeddings_cpu cpu target, which is used by quite a few cpu only build.
we should handle this properly, so first backout the diff to mitigate.

Reviewed By: q10

Differential Revision: D93902471

fbshipit-source-id: 7e36e7d7f8f78ebd3d3e062ec5fe2861021fd57c
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2399

Pull Request resolved: pytorch#5425

- Skip tests if GPU memory is insufficient

Reviewed By: henrylhtsang

Differential Revision: D93970398

fbshipit-source-id: afea4a172817e614ea8c68f9528f43618f072312
Summary:
Pull Request resolved: pytorch#5426

X-link: https://github.com/facebookresearch/FBGEMM/pull/2398

- Trigger pytorch builds on exact SHA

Reviewed By: henrylhtsang

Differential Revision: D93541541

fbshipit-source-id: b1dae6396833c307d937724d623281fba2856f40
Summary:
Pull Request resolved: pytorch#5427

X-link: https://github.com/facebookresearch/FBGEMM/pull/2400

- Fix ROCm Nova builds

Reviewed By: cthi

Differential Revision: D94104910

fbshipit-source-id: 832063400cff5665cec900105a8be4fce8854a83
Summary:
Pull Request resolved: pytorch#5428

X-link: https://github.com/facebookresearch/FBGEMM/pull/2401

- Fix build warnings

Reviewed By: cthi

Differential Revision: D94113092

fbshipit-source-id: 936892e39b62b85c6da76dda67c187715c519baa
@meta-cla meta-cla bot added the cla signed label Feb 26, 2026
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Feb 26, 2026

@chouxi has exported this pull request. If you are a Meta employee, you can view the originating Diff in D94431329.

gchalump and others added 7 commits February 25, 2026 17:16
Summary:
Pull Request resolved: pytorch#5431

X-link: https://github.com/facebookresearch/FBGEMM/pull/2403

The PyPI binary size limit was being exceeded in nightly builds. This change removes 12.0a from nightly to maintain compliance with the PyPI binary size limit.

Previously, release builds excluded 12.0a while nightly builds included it. This inconsistency caused nightly builds to exceed the size limit. Now both build types use the same architecture list.

Reviewed By: spcyppt

Differential Revision: D94406173

fbshipit-source-id: bbc1d382dd16494377d1b8e9586f6cb8d3af2195
Summary:
Pull Request resolved: pytorch#5435

X-link: https://github.com/facebookresearch/FBGEMM/pull/2406

as title

Reviewed By: sryap, q10

Differential Revision: D94480320

fbshipit-source-id: 00abd2d17fd818dfd575bd0fa16fcb737bc425e6
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2405

Pull Request resolved: pytorch#5433

Fix CI failure on Python 3.14 caused by MarkupSafe 2.1.5 circular import error.

The jinja2 dependency was pulling in MarkupSafe 2.1.5, which has an initialization issue with Python 3.14:
```
AttributeError: partially initialized module 'markupsafe' has no attribute 'Markup' (most likely due to a circular import)
SystemError: initialization of _speedups raised unreported exception
```

MarkupSafe 3.0.0+ includes proper Python 3.14 support and resolves this compatibility issue.

Reviewed By: q10

Differential Revision: D94448875

fbshipit-source-id: 8f0058de6701548ba7a0ed11cc37d57e40e09a59
Summary:
- Install specific CUDA components instead of full metapackage

Mirror of D94536031

Reviewed By: cthi

Differential Revision: D94548691

fbshipit-source-id: e24823b5d3f7ae4afe6114b85705c9f2a9c981cc
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2410

Pull Request resolved: pytorch#5434

The Triton TBE benchmark tool (triton_table_batched_embeddings_bench.py) previously accepted only a single value for --num-embeddings (E), --embedding-dim (D), and --bag-size (L), applying the same configuration uniformly across all tables. This made it impossible to benchmark realistic sharding plans where each table has different hash sizes, embedding dimensions, and pooling factors.

Per-table L support via Ls parameter in generate_requests (requests.py):

Added a new Ls: Optional[list[int]] parameter to generate_requests() that accepts explicit per-table bag sizes.
When Ls is provided, builds L_offsets using the same numpy operations as the existing sigma_L / generate_pooling_factors_from_stats path (np.concatenate, np.tile, np.insert(...cumsum(), 0, 0), torch.from_numpy), ensuring consistent behavior.
Mutually exclusive with sigma_L (asserts if both provided).
This avoids code duplication — the benchmark tool calls generate_requests(..., Ls=Ls) instead of maintaining a separate function.

# repro rank65 in slow job
STats pulled from zoomer planner stats: https://www.internalfb.com/ai_infra/zoomer/mast_job/job_insights?tab=MAST_JOB_JOB_INSIGHTS&primarySubtab=Sharding%20Analysis&secondarySubtab=Sharding%20Analysis&jobName=aps-f1037855583-1037860081&jobVersion=0&jobAttempt=0

Using a CVS file to consolidate it.

Reviewed By: JChunX

Differential Revision: D94317840

fbshipit-source-id: 3dfb13d6e68b84c5a4aabc49b6c6a9eda66da402
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2402

Pull Request resolved: pytorch#5430

# TLDR;

TBE unit tests constructs offsets via
```
torch.tensor([0] + np.cumsum(lengths).tolist()).long()
```
in which `torch.tensor()` infers `torch.float32`.
Float32 can only represent consecutive integers up to 2^24 (16,777,216); beyond that, odd values round to even, producing wrong offsets.
The subsequent `.long()` preserves the already-corrupted values.

So for large data where offsets values are supposed to be larger than 16,777,216 can cause offsets to be wrong. This also affects backward_adagrad_large dim test.

The diff
 - corrects dtype such that there's no precision loss due to conversion.
- re-enables backward_adagrad_large_dim test.

----

# Problem

We see TBE forward output mismatch errors in unit tests when we set up large T (number of features) and large B (batch size). We can reproduce this error using the test `test_backward_adagrad_with_simple_tbe_op`
where
- T = 230, B = 294,440
- T = 57, B = 294,440 (just enough to cause error)
- T = 230, B = 184320

# Root cause
After debugging, the root cause was due to the test indices construction and not from the kernels.

- `np.ones() * L` produces a float64 numpy array
- `.tolist()` yields Python floats (float64)
- `torch.tensor()` infers `torch.float32` (the
default dtype) — narrowing 64-bit doubles to 32-bit floats.

Float32 can only represent consecutive integers up to 2^24 (16,777,216); beyond that, odd values round to even, producing wrong offsets. The subsequent `.long()` preserves the already-corrupted values.

This caused (T=230, B=184320, total_B=42M) to fail:
the kernel received wrong offsets for b_t >= 2^24, computing wrong bag lengths (e.g., L=0 or L=2 instead
of 1) and looking up wrong embedding rows. Both v1 and v2 TBE
forward kernels were affected identically since they share the same offsets tensor.

Based on the example above:
```
Offsets for Table 56 (partially below 2^24, and partially above 2^24):
16486400.0 ... 16780796.0 16780798.0 16780800.0
                      ^^^^^^^^^ odd values missing (L becomes 2, as opposed to 1)

Offsets for Table 57 (fully above 2^24)
16780800.0 16780800.0 16780802.0  ...
           ^^^^^^^^^ should be 16780801, duplicated (L becomes 0, where 1 is expected)

Offsets for Table 113 (above 2^25 -- step size increases to 4):
33267200.0 33267200.0 33267202.0 33267204.0 33267204.0 33267204.0 33267206.0 33267208.0 33267208.0 33267208.0
33561592.0 33561592.0 33561592.0 33561592.0 33561592.0 33561596.0 33561596.0 33561596.0 33561600.0 33561600.0
                ^^^^^^^^^ 3 identical values (more L=0)

Offsets for Table 229 (near 2^26 -- step size is 4, massive gaps):
67417600.0 67417600.0 67417600.0 67417600.0 67417600.0 67417608.0 67417608.0 67417608.0 67417608.0 67417608.0
67711992.0 67711992.0 67711992.0 67711992.0 67711992.0 67711992.0 67712000.0 67712000.0 67712000.0 67712000.0
                ^^^^^^^^^ 5 identical values (more L=0)
```
# Fix

The fix ensure `np.one` to use `int` and specifies `dtype=torch.int64` (or `dtype=torch.int32`) directly in `torch.tensor()`, eliminating the float32 conversion.

The fix applies to all offsets constructions or similar usage of np and torch.tensor.

Reviewed By: gchalump, q10

Differential Revision: D94345333

fbshipit-source-id: 5085ab31f4e9ebb05d269c8708efff454655eaa8
Summary:

X-link: facebookresearch/FBGEMM#2404

Thread RES (Raw Embedding Streaming) parameters through the DRAM KV
embedding cache constructor chain and pybind to enable streaming
for the embedding cache enrichment path.
Currently the feature is gated by `enable_raw_embedding_streaming`
Key changes:

- Thread 6 RES params (DramKVEmbeddingCache -> wrapper -> pybind -> Python)
- Make raw_embedding_streamer_ protected for subclass access

Reviewed By: FriedCosey

Differential Revision: D94431329
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Feb 28, 2026

This pull request has been merged in 0a963db.

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 517378f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.