Thread raw embedding streamer to dram_kv_embedding_cache#5432
Closed
chouxi wants to merge 42 commits intopytorch:mainfrom
Closed
Thread raw embedding streamer to dram_kv_embedding_cache#5432chouxi wants to merge 42 commits intopytorch:mainfrom
chouxi wants to merge 42 commits intopytorch:mainfrom
Conversation
Summary: Pull Request resolved: pytorch#5381 X-link: https://github.com/facebookresearch/FBGEMM/pull/2360 Add unit test that takes TBE configurations loaded from JSON files. This enables testing with production TBE configurations extracted from training logs. Changes: - Add `extract_specs_from_log.py` script to extract TBE configurations from training logs into a structured JSON format - Add helper functions to `test/tbe/common.py` for parsing config values: `parse_sparse_type`, `parse_pooling_mode`, `parse_embedding_location`, `parse_compute_device`, `parse_cache_algorithm`, `create_tbe_from_config`, `load_tbe_configs_from_file` - Add `test_backward_adagrad_from_config_file` test that loads TBE configs from JSON and runs backward adagrad validation - Add `input_tbe_specs.json` as a test resource with sample TBE configurations Reviewed By: q10, gchalump Differential Revision: D92390199 fbshipit-source-id: f25c9981fafdb1a6e2661bb5ce9b025d4da6995d
…#5376) Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2364 This PR fixes a bug in floating point specialized store function and adds runtime check to skip optimized kernel in case `emb_t != grad_t`. Support for mixed precision is planned Pull Request resolved: pytorch#5376 Reviewed By: cthi Differential Revision: D92419356 Pulled By: q10 fbshipit-source-id: 108ed99664464e3b21c477f555024fe159ef882e
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2363 Pull Request resolved: pytorch#5387 - Update OSS build script to include test packages setup Reviewed By: henrylhtsang Differential Revision: D92550346 fbshipit-source-id: 8ed1928505aba212b433eddf35da993f2d136b9c
Summary: Pull Request resolved: pytorch#5385 This PR uses std::tuple/std::pair for shorted code, use loop bindings and emplace_back for clearer code and performance. Pull Request resolved: pytorch#5384 Reviewed By: henrylhtsang Differential Revision: D92545407 Pulled By: q10 fbshipit-source-id: c08bfd7c0b9738aab31647edacbfb269de1c5af8
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2361 Pull Request resolved: pytorch#5382 - Migrate FBGEMM assert to FBGEMM_CHECK, pt 1 Reviewed By: sryap, cthi Differential Revision: D92443156 fbshipit-source-id: 8d8701d7f785e9d2f653eaf79c60fe600d392555
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2369 X-link: meta-pytorch/MSLK#140 - Update OSS build script to include test packages setup Reviewed By: cthi Differential Revision: D92733208 fbshipit-source-id: fbe1fb042a94da08f73eff1c7c37a776d7675b37
Summary: Pull Request resolved: pytorch#5394 This PR moves "-Wno-unused-command-line-argument" to clang to avoid build failures. Pull Request resolved: pytorch#5390 Reviewed By: cthi Differential Revision: D92645775 Pulled By: q10 fbshipit-source-id: cda310a0794cf1f76fc462cc0b9263be341ad721
Summary: Pull Request resolved: pytorch#5395 X-link: https://github.com/facebookresearch/FBGEMM/pull/2370 This PR applies more modernization fixes for C++ 17/20. Pull Request resolved: pytorch#5391 Reviewed By: cthi Differential Revision: D92645749 Pulled By: q10 fbshipit-source-id: 5914332bd6ea50c70c7b5d2cdd1f5ed0abd86bd1
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2372 Pull Request resolved: pytorch#5397 Three fixes to the backward_adagrad test: 1. **Fix bs_features construction**: The previous logic for building `bs_features` from `feature_table_map` was incorrect when tables were replicated across multiple features. It used manual insertion logic that was fragile and produced wrong mappings. Replaced with a direct list comprehension `[bs[t] for t in feature_table_map]` which correctly maps each feature to its corresponding table's embedding module. 2. **Fix OOM in gradcheck**: When `tbe_op` is provided, the test was proceeding into torch gradcheck which allocates significant additional GPU memory, causing OOM. Added an early return when `tbe_op` is set, and explicitly freed large tensors (`bs`, `bs_features`, `fs`, `gos`, `cc`, etc.) before gradcheck to reduce GPU memory pressure. 3. **Add TBE_CONFIG_INDEX env var**: Added support for a `TBE_CONFIG_INDEX` environment variable to select a single TBE config to test, allowing targeted debugging of individual configurations without running all configs. 4. **Fix T_ calculation**: Changed `T_ = len(xws)` to `T_ = num_features` for the `feature_requires_grad` tensor size, since its length should match the number of features, not the number of input tensors. Reviewed By: q10 Differential Revision: D92668399 fbshipit-source-id: 0956adf01c2291fcba7b05fe39b70f669a408156
Summary: Pull Request resolved: pytorch#5398 X-link: https://github.com/facebookresearch/FBGEMM/pull/2374 Original commit changeset: 5914332bd6ea Original Phabricator Diff: D92645749 Reviewed By: tonykao8080 Differential Revision: D92856605 fbshipit-source-id: 4c1c6ab1e30138e2749764c747a6be1c2e5b5028
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2368 Pull Request resolved: pytorch#5392 X-link: meta-pytorch/torchrec#3753 D88706297 Reviewed By: zlzhao1104 Differential Revision: D92738272 fbshipit-source-id: 57b9b66c7b0be5536e9a5526cc44fac216b94a1e
Summary: Pull Request resolved: pytorch#5399 X-link: https://github.com/facebookresearch/FBGEMM/pull/2375 The bucket assignment in get_bucket_id_from_ratio uses division: floor(ratio / stride) + 1. At boundary values, floating point division rounds down, placing blocks with ratio = threshold_bucket × stride into threshold_bucket. The eviction check in evict_block uses strict less-than: overall_ratio < threshold, where threshold = threshold_bucket × stride. Since ratio equals threshold (same floating point value), the comparison evaluates to false. These blocks are counted by the dry run (included in acc_count) but not evicted (fail the < check). Reviewed By: steven1327 Differential Revision: D92900276 fbshipit-source-id: 925ec04727db751990414c0c3d47a37536ca8210
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2379 Pull Request resolved: pytorch#5403 - Upgrade glibc to 2.28 Reviewed By: cthi Differential Revision: D93138551 fbshipit-source-id: c80a14e739d1dfb138b7976edcc00e17f671b835
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2376 Pull Request resolved: pytorch#5400 - Scripts for building PyTorch Reviewed By: cthi Differential Revision: D92864314 fbshipit-source-id: 98db550101cdc396902079534a8c33372f141769
Summary: Pull Request resolved: pytorch#5404 X-link: https://github.com/facebookresearch/FBGEMM/pull/2380 - Fix ROCm benchmark workflow Reviewed By: henrylhtsang Differential Revision: D93157777 fbshipit-source-id: c243cfb97c3f8e8e96e86b23a2f48dadf73d45e8
Summary: Pull Request resolved: pytorch#5406 X-link: https://github.com/facebookresearch/FBGEMM/pull/2381 This PR implements algorithmic optimization for `group_index_select_or_add` kernel. The main focus is `USE_VAR_COLS=true` cases. We noticed that `member_id` search takes 1/2 of kernel latency in some cases. There are 2 assumptions: 1. Considering warp_id is in ascending order, we don't need to run the search on every iteration. It's more efficient to cache the member_id and upper_bound. 2. Binary upper bound search implementation (which seems to be a copy-paste of CPU version) is inefficient on GPU (and most likely on CPU as well for small/medium size problems) With a minor adjustments these change might be adapted for CUDA path as well Pull Request resolved: pytorch#5365 | Instantiation | Count | Mean (us) (Before) | Mean (us) (After) | % Improvement | |--------------------------------------------------------------|-------|---------------------|---------------------|--------------| | group_index_select_or_add_2d_kernel<long, float, false, t... | 1920 | 18080.67 | 17002.10 | 5.98% | | group_index_select_or_add_2d_kernel<long, float, false, f... | 960 | 9256.85 | 9750.73 | -5.35% | | group_index_select_or_add_2d_kernel<long, float, true, tr... | 1920 | 3987.01 | 2225.99 | 44.18% | | group_index_select_or_add_2d_kernel<long, float, true, fa... | 960 | 2618.33 | 2732.18 | -4.35% | | index_select_scalar_cumsum_kernel<long, int, long, 256, 1... | 6100 | 7.19 | 7.12 | 0.97% | | index_select_scalar_cumsum_kernel<int, int, int, 256, 102... | 1300 | 9.29 | 8.60 | 7.42% | | TOTAL | 13160 | 4090.13 | 3720.07 | 9.04% | ``` https://docs.google.com/document/d/1dqJQhgsz_rVlvmrO0YsJ3ZpyahMW-cFoI6rDRWXjiwg/edit?tab=t.0#heading=h.5row1qfol66k Test Plan: ``` cd ~/fbsource/fbcode/ai_codesign/nonprod/bensonma415/scripts/D92178105 # Run on MI300, MI350, A100, H100, GB200 bash run_benchmark.sh 2>&1 | pastry ``` ``` # E2E Analysis # Before Diff buck run //ai_codesign/nonprod/bensonma415/scripts:download_and_analyze_traces -- \ --url "https://www.internalfb.com/ai_infra/zoomer/mast_job/overview?jobName=aps-cmf_amd_echen40_20260212_154515-5fc47d5fee&jobVersion=0&jobAttempt=0" \ -k index_select_scalar_cumsum_kernel \ -k group_index_select_or_add_2d_kernel \ --csv results.csv 2>&1 | pastry > P2186151617: https://www.internalfb.com/intern/paste/P2186151617/ # After Diff buck run //ai_codesign/nonprod/bensonma415/scripts:download_and_analyze_traces -- \ --url "https://www.internalfb.com/ai_infra/zoomer/mast_job/overview?jobName=aps-cmf_amd_echen40_20260212_162821-e2bad24c93&jobVersion=0&jobAttempt=3" \ -k index_select_scalar_cumsum_kernel \ -k group_index_select_or_add_2d_kernel \ --csv results.csv 2>&1 | pastry > P2186174768: https://www.internalfb.com/intern/paste/P2186174768/ Reviewed By: spcyppt Differential Revision: D92178105 Pulled By: q10 fbshipit-source-id: af99d2e1f63c40b8dd6963735d3e9f4136ad1554
…#5408) Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2384 Pull Request resolved: pytorch#5408 - Fix kernel launcher test to work with CUDA_LAUNCH_BLOCKING=1 Reviewed By: sryap Differential Revision: D93292796 fbshipit-source-id: 726a991a3f9dd2af027ade4667edefd6bc8beb08
Summary: Pull Request resolved: pytorch#5405 # Context We are currently lacking CPU-GPU parity for `rowwise_adagrad_with_counter`, and this is evidenced by the `test_backward_optimizers_adagrad_with_counter_cpu` test failing if (1) it is enabled (currently skipped) and (2) if the new and ref weights are compared with tolerance 1.0e-4. This is because `rowwise_adagrad_with_counter` is missing various regularization and weight decay features. # The Fix In `rowwise_adagrad_with_counter`, I made the CPU weight update logic mirror that of the GPU. After running the `test_backward_optimizers_adagrad_with_counter_cpu` unit test, I have verified that it now passes, indicating correctness for the various tested configurations. Consequently, I was able to reduce the new vs ref weight tolerance to 1.0e-4, and verified that the rest of the tests in the file still pass. Reviewed By: spcyppt Differential Revision: D91080112 fbshipit-source-id: 6b5a0cf4a31c8b6e41f5cf7db5ea6206e634263d
Summary: Pull Request resolved: pytorch#5407 X-link: https://github.com/facebookresearch/FBGEMM/pull/2383 **Change to CUDA_KERNEL_ASSERT** ================================= CUDA_KERNEL_ASSERT2 returns when the runtime flag is not passsed. https://docs.google.com/document/d/1m0-5ana08jBZZfgp9lmbM3W1qYVf6TNOoRoXJoOVKuw/edit?tab=t.0#heading=h.e58hjdsk3ro2 **Summary** ----------- This diff updates the `CUDA_KERNEL_ASSERT` statements in various files of the `fbgemm_gpu` library to provide more informative error messages. This change is made in five different files, each containing a separate update to the `CUDA_KERNEL_ASSERT` statement. **Files Affected** ------------------- * `sparse_pack_segments_forward.cu` * `embedding_bounds_check_v1.cu` * `gather_ranges_to_dense_v2.cu` * `embedding_bounds_check_v2.cu` * `sparse_index_select.cu` Reviewed By: q10 Differential Revision: D93033489 fbshipit-source-id: 60287d12f44df4709cc64bab9f8d156062e54d30
Summary: Pull Request resolved: pytorch#5189 X-link: https://github.com/facebookresearch/FBGEMM/pull/2187 populate_bucketized_permute is being used in request batching for seq embedding in RW Currently, populate_bucketized_permute is parallelized by one thread per sequence, this is due to the fact that: - This kernel is for seq embedding where the order of the ids matter - We then need to calculate the offsets of the ids one by one serially This diff improves the parallelism from thread to warp level using warp-level sync primitives (`__ballot_sync`/`__popc`). It assumes less than 32 shards (max warp size) which is predominantly true in serving. There are alternatives to this approach with pros and cons, we could do prefix sum by world size to get the offset for each row and then do parallelism per index. The pros is more parallelism the cons is more kernels (we need world size - 1 of prefix sum). Given that at least for recsys models, the bottleneck is on GPU and that this approach increases parallelism by 32 folds, i leaned toward this approach # Rollback Plan The new warp-parallel kernel is gated behind the `BUCKETIZED_PERMUTE_WARP_KERNEL` feature gate: - **JustKnobs**: `fbgemm_gpu/features:BUCKETIZED_PERMUTE_WARP_KERNEL` — set to `false` to disable the warp-parallel kernel and fall back to the original per-thread kernel - **Environment variable**: `FBGEMM_BUCKETIZED_PERMUTE_WARP_KERNEL=0` (for OSS/testing) - When the feature gate is disabled, the code falls back to the original `_populate_bucketized_permute_cuda_kernel` with no behavior change - When `my_size > 32` (kWarpSize), the original kernel is always used regardless of the gate # Changes addressing reviewer feedback - Used `CUDA_KERNEL_ASSERT` with meaningful error messages - Added bounds check for `bucketized_offsets_data` access - Simplified `my_bucket` initialization with ternary - Inlined `pos` variable assignment - Added comment explaining the second `__ballot_sync` call - Added `BUCKETIZED_PERMUTE_WARP_KERNEL` feature gate (C++ enum + Python enum) - Added `config_cpp` dependency to `sparse_ops_gpu` BUCK target - Added comprehensive test `test_populate_bucketized_permute_warp_parallel_vs_original` comparing GPU warp-parallel kernel against CPU reference across 6 configurations - Set `FBGEMM_NO_JK=1` and `FBGEMM_BUCKETIZED_PERMUTE_WARP_KERNEL=1` env vars in test BUCK target Reviewed By: spcyppt, q10 Differential Revision: D88522379 fbshipit-source-id: 2bcc0e56a9293cf6b4ff575265a1fb4639bf18f0
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2387 Pull Request resolved: pytorch#5412 - Update PyTorch OSS build scripts Reviewed By: henrylhtsang Differential Revision: D93157176 fbshipit-source-id: 9ca81450af86a39e9e5e845eee4e0fb58ae788eb
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2389 Pull Request resolved: pytorch#5414 Reviewed By: henrylhtsang Differential Revision: D93662599 Pulled By: q10 fbshipit-source-id: c1d7f5b8f343778bda774158340aa32cae9220c8
Summary: Pull Request resolved: pytorch#5251 X-link: https://github.com/facebookresearch/FBGEMM/pull/2219 This change improves the performance of tracking the deltas in TBE, mainly by replacing DtoH copy with {F1984231816} with DtoD copy with async DtoH under stream_callback {F1984231839} To achieve this, the following is added - the pre-registered UVA buffer that's accessible from both GPU and CPU are reused every iteration - makes the lifetime of tensors the same to TBE makes it safe to async copy. - reuse the same buffer to avoid repeating allocation. - trigger the CPU thread to async copy in raw_embedding_streamer.stream() - GPU ops don't wait on the D2H - To avoid the change of the buffer overlaps with D2H copy - A GPU event to track the finish of the D2D copy, make the CPU thread to wait for the D2D copy finish - join_stream_tensor_copy_thread to trigger a blocking wait for the copy in the next iteration in case of CPU copies take too long before overwriting the pre-registered buffer. Reviewed By: q10 Differential Revision: D86888586 fbshipit-source-id: fe1c4bc7c2af28b201bafa6557617bef634b995c
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2390 Pull Request resolved: pytorch#5416 Reviewed By: cthi Differential Revision: D93762155 Pulled By: q10 fbshipit-source-id: 4b8b7f025b646579e397fa1c5fd9aced99d94eb0
Summary: Pull Request resolved: pytorch#5410 X-link: https://github.com/facebookresearch/FBGEMM/pull/2385 Stop using `constexpr` with `__builtin_constant_p`. Turns out using `constexpr` results in the expression being immediately evaluated by the frontend before any inlining happens; always returning `false` in this context. However this here was meant to be evaluated AFTER the function got inlined (in case of fixed `count` after specialization the compiler can fully promote the local array to vector register which is only possible when it can analyze with the known `memset` semantics. Reviewed By: Nicoshev Differential Revision: D93544122 fbshipit-source-id: 5970e4225e09c1433f7ad989276937e4cf93164f
Summary: Pull Request resolved: pytorch#5418 X-link: meta-pytorch/MSLK#158 X-link: https://github.com/facebookresearch/FBGEMM/pull/2392 When using CUTLASS grouped GEMM for wgrad, experts that receive zero tokens produce NaN weight gradients. The root cause is in bf16bf16bf16_grouped_wgrad.cu: when total_M > 0 (at least one expert has tokens), the output buffer was allocated with at::empty (uninitialized memory). The kernel's set_stacked_kernel_args_kernel skips experts where M_sizes[i] == 0, so those experts never write to their portion of the output, leaving NaN/garbage values. This is triggered consistently in RL training with 256 experts and small microbatches, where many experts receive zero tokens per microbatch. The fix changes at::empty to at::zeros so that zero-token experts get zero gradients instead of uninitialized memory. The cost of zeroing vs not is negligible relative to the GEMM computation. Also adds dedicated tests for wgrad with zero-token experts across FBGEMM, MSLK, and MSL test suites. The tests use a warmup-based cache poisoning technique — running the kernel once with all experts having tokens to fill the CUDA caching allocator with a non-zero G*N*K buffer, then freeing it so the allocator reuses that dirty block for the next at::empty call. This makes uninitialized-memory bugs deterministic and reproducible in tests, rather than relying on non-deterministic allocator behavior. Tests cover both total_M == 0 (all experts zero tokens) and total_M > 0 with partial zero-token experts. Reviewed By: cthi Differential Revision: D93777683 fbshipit-source-id: b0445ca0dcff94663a6ed10d2c1949e017dbd760
Summary: Pull Request resolved: pytorch#5417 X-link: https://github.com/facebookresearch/FBGEMM/pull/2391 - Fix undefined symbol error in OSS that was introduced with D88522379 (PR pytorch#5189) Reviewed By: cthi Differential Revision: D93773790 fbshipit-source-id: dca30fc82c34d36c4a6f73e21c5e666ff777d75f
…h#5420) Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2394 Pull Request resolved: pytorch#5420 - Disable building asmjit and fbgemm when building genai module Reviewed By: jiawenliu64 Differential Revision: D93832742 fbshipit-source-id: e317137d2399c437a01830cc779ca65e07d616d6
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2395 - Update FBGEMM OSS build scripts Reviewed By: cthi Differential Revision: D93571865 fbshipit-source-id: 2db91699221195d0213e28e24fa6c971a409b0c1
Summary: Pull Request resolved: pytorch#5424 X-link: https://github.com/facebookresearch/FBGEMM/pull/2397 There's a bug in TBE v2 kernel. Skip the test for now to unblock OSS. Reviewed By: q10 Differential Revision: D93905146 fbshipit-source-id: afaa8b5b4d98e5c361a5f391dafc5ea5a2bdf5ee
…5423) Summary: Pull Request resolved: pytorch#5423 X-link: https://github.com/facebookresearch/FBGEMM/pull/2396 Original commit changeset: fe1c4bc7c2af Original Phabricator Diff: D86888586 The culprit diff pulls in the cuda deps to the raw_embedding_streamer target which was exported via the split_table_batched_embeddings_cpu cpu target, which is used by quite a few cpu only build. we should handle this properly, so first backout the diff to mitigate. Reviewed By: q10 Differential Revision: D93902471 fbshipit-source-id: 7e36e7d7f8f78ebd3d3e062ec5fe2861021fd57c
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2399 Pull Request resolved: pytorch#5425 - Skip tests if GPU memory is insufficient Reviewed By: henrylhtsang Differential Revision: D93970398 fbshipit-source-id: afea4a172817e614ea8c68f9528f43618f072312
Summary: Pull Request resolved: pytorch#5426 X-link: https://github.com/facebookresearch/FBGEMM/pull/2398 - Trigger pytorch builds on exact SHA Reviewed By: henrylhtsang Differential Revision: D93541541 fbshipit-source-id: b1dae6396833c307d937724d623281fba2856f40
Summary: Pull Request resolved: pytorch#5427 X-link: https://github.com/facebookresearch/FBGEMM/pull/2400 - Fix ROCm Nova builds Reviewed By: cthi Differential Revision: D94104910 fbshipit-source-id: 832063400cff5665cec900105a8be4fce8854a83
Summary: Pull Request resolved: pytorch#5428 X-link: https://github.com/facebookresearch/FBGEMM/pull/2401 - Fix build warnings Reviewed By: cthi Differential Revision: D94113092 fbshipit-source-id: 936892e39b62b85c6da76dda67c187715c519baa
Contributor
Summary: Pull Request resolved: pytorch#5431 X-link: https://github.com/facebookresearch/FBGEMM/pull/2403 The PyPI binary size limit was being exceeded in nightly builds. This change removes 12.0a from nightly to maintain compliance with the PyPI binary size limit. Previously, release builds excluded 12.0a while nightly builds included it. This inconsistency caused nightly builds to exceed the size limit. Now both build types use the same architecture list. Reviewed By: spcyppt Differential Revision: D94406173 fbshipit-source-id: bbc1d382dd16494377d1b8e9586f6cb8d3af2195
Summary: Pull Request resolved: pytorch#5435 X-link: https://github.com/facebookresearch/FBGEMM/pull/2406 as title Reviewed By: sryap, q10 Differential Revision: D94480320 fbshipit-source-id: 00abd2d17fd818dfd575bd0fa16fcb737bc425e6
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2405 Pull Request resolved: pytorch#5433 Fix CI failure on Python 3.14 caused by MarkupSafe 2.1.5 circular import error. The jinja2 dependency was pulling in MarkupSafe 2.1.5, which has an initialization issue with Python 3.14: ``` AttributeError: partially initialized module 'markupsafe' has no attribute 'Markup' (most likely due to a circular import) SystemError: initialization of _speedups raised unreported exception ``` MarkupSafe 3.0.0+ includes proper Python 3.14 support and resolves this compatibility issue. Reviewed By: q10 Differential Revision: D94448875 fbshipit-source-id: 8f0058de6701548ba7a0ed11cc37d57e40e09a59
Summary: - Install specific CUDA components instead of full metapackage Mirror of D94536031 Reviewed By: cthi Differential Revision: D94548691 fbshipit-source-id: e24823b5d3f7ae4afe6114b85705c9f2a9c981cc
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2410 Pull Request resolved: pytorch#5434 The Triton TBE benchmark tool (triton_table_batched_embeddings_bench.py) previously accepted only a single value for --num-embeddings (E), --embedding-dim (D), and --bag-size (L), applying the same configuration uniformly across all tables. This made it impossible to benchmark realistic sharding plans where each table has different hash sizes, embedding dimensions, and pooling factors. Per-table L support via Ls parameter in generate_requests (requests.py): Added a new Ls: Optional[list[int]] parameter to generate_requests() that accepts explicit per-table bag sizes. When Ls is provided, builds L_offsets using the same numpy operations as the existing sigma_L / generate_pooling_factors_from_stats path (np.concatenate, np.tile, np.insert(...cumsum(), 0, 0), torch.from_numpy), ensuring consistent behavior. Mutually exclusive with sigma_L (asserts if both provided). This avoids code duplication — the benchmark tool calls generate_requests(..., Ls=Ls) instead of maintaining a separate function. # repro rank65 in slow job STats pulled from zoomer planner stats: https://www.internalfb.com/ai_infra/zoomer/mast_job/job_insights?tab=MAST_JOB_JOB_INSIGHTS&primarySubtab=Sharding%20Analysis&secondarySubtab=Sharding%20Analysis&jobName=aps-f1037855583-1037860081&jobVersion=0&jobAttempt=0 Using a CVS file to consolidate it. Reviewed By: JChunX Differential Revision: D94317840 fbshipit-source-id: 3dfb13d6e68b84c5a4aabc49b6c6a9eda66da402
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2402 Pull Request resolved: pytorch#5430 # TLDR; TBE unit tests constructs offsets via ``` torch.tensor([0] + np.cumsum(lengths).tolist()).long() ``` in which `torch.tensor()` infers `torch.float32`. Float32 can only represent consecutive integers up to 2^24 (16,777,216); beyond that, odd values round to even, producing wrong offsets. The subsequent `.long()` preserves the already-corrupted values. So for large data where offsets values are supposed to be larger than 16,777,216 can cause offsets to be wrong. This also affects backward_adagrad_large dim test. The diff - corrects dtype such that there's no precision loss due to conversion. - re-enables backward_adagrad_large_dim test. ---- # Problem We see TBE forward output mismatch errors in unit tests when we set up large T (number of features) and large B (batch size). We can reproduce this error using the test `test_backward_adagrad_with_simple_tbe_op` where - T = 230, B = 294,440 - T = 57, B = 294,440 (just enough to cause error) - T = 230, B = 184320 # Root cause After debugging, the root cause was due to the test indices construction and not from the kernels. - `np.ones() * L` produces a float64 numpy array - `.tolist()` yields Python floats (float64) - `torch.tensor()` infers `torch.float32` (the default dtype) — narrowing 64-bit doubles to 32-bit floats. Float32 can only represent consecutive integers up to 2^24 (16,777,216); beyond that, odd values round to even, producing wrong offsets. The subsequent `.long()` preserves the already-corrupted values. This caused (T=230, B=184320, total_B=42M) to fail: the kernel received wrong offsets for b_t >= 2^24, computing wrong bag lengths (e.g., L=0 or L=2 instead of 1) and looking up wrong embedding rows. Both v1 and v2 TBE forward kernels were affected identically since they share the same offsets tensor. Based on the example above: ``` Offsets for Table 56 (partially below 2^24, and partially above 2^24): 16486400.0 ... 16780796.0 16780798.0 16780800.0 ^^^^^^^^^ odd values missing (L becomes 2, as opposed to 1) Offsets for Table 57 (fully above 2^24) 16780800.0 16780800.0 16780802.0 ... ^^^^^^^^^ should be 16780801, duplicated (L becomes 0, where 1 is expected) Offsets for Table 113 (above 2^25 -- step size increases to 4): 33267200.0 33267200.0 33267202.0 33267204.0 33267204.0 33267204.0 33267206.0 33267208.0 33267208.0 33267208.0 33561592.0 33561592.0 33561592.0 33561592.0 33561592.0 33561596.0 33561596.0 33561596.0 33561600.0 33561600.0 ^^^^^^^^^ 3 identical values (more L=0) Offsets for Table 229 (near 2^26 -- step size is 4, massive gaps): 67417600.0 67417600.0 67417600.0 67417600.0 67417600.0 67417608.0 67417608.0 67417608.0 67417608.0 67417608.0 67711992.0 67711992.0 67711992.0 67711992.0 67711992.0 67711992.0 67712000.0 67712000.0 67712000.0 67712000.0 ^^^^^^^^^ 5 identical values (more L=0) ``` # Fix The fix ensure `np.one` to use `int` and specifies `dtype=torch.int64` (or `dtype=torch.int32`) directly in `torch.tensor()`, eliminating the float32 conversion. The fix applies to all offsets constructions or similar usage of np and torch.tensor. Reviewed By: gchalump, q10 Differential Revision: D94345333 fbshipit-source-id: 5085ab31f4e9ebb05d269c8708efff454655eaa8
Summary: X-link: facebookresearch/FBGEMM#2404 Thread RES (Raw Embedding Streaming) parameters through the DRAM KV embedding cache constructor chain and pybind to enable streaming for the embedding cache enrichment path. Currently the feature is gated by `enable_raw_embedding_streaming` Key changes: - Thread 6 RES params (DramKVEmbeddingCache -> wrapper -> pybind -> Python) - Make raw_embedding_streamer_ protected for subclass access Reviewed By: FriedCosey Differential Revision: D94431329
c32d49c to
60b1aac
Compare
Contributor
|
This pull request has been merged in 0a963db. |
Contributor
|
This pull request has been reverted by 517378f. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2404
Thread RES (Raw Embedding Streaming) parameters through the DRAM KV
embedding cache constructor chain and pybind to enable streaming
for the embedding cache enrichment path.
Currently the feature is gated by
enable_raw_embedding_streamingKey changes:
Differential Revision: D94431329