Significant time spent on dense_to_jagged operations contradicts compute-intensive claims in paper #192

mia1460 · 2025-02-23T12:15:06Z

Issue Report: Significant time spent on dense_to_jagged operations contradicts compute-intensive claims in paper`

Description

Hi maintainers,

When profiling the eval_metrics_v2_from_tensors module under research code, I observed that the majority of the execution time is spent on dense_to_jagged memory copy operations rather than compute-intensive operations as mentioned in the paper.

This behavior seems contradictory to the paper's emphasis on compute-bound workloads. Specifically:

Observed 70-80% time spent on tensor conversion/copy operations.
Actual metric computations appear to occupy a smaller portion of runtime.
Hardware utilization (GPU/TPU) seems lower than expected.

Steps to Reproduce

Run profiling on eval_metrics_v2_from_tensors with the ml-20m dataset.
Use PyTorch profiler for performance analysis.
Observe time distribution across operations.

Expected Behavior

Expect to see dominant time spent on:

Matrix factorization computations.
Attention operations.
Embedding interactions.

As suggested by the paper's computational complexity analysis.

Environment

PyTorch version: 2.2.1
Hardware: RTX A6000 48GB
Dataset scale: The same as ml-20m.
Batch size used: 128

Additional Context

Attached profiling trace screenshot:

Questions

Would appreciate your insights on whether:

This is a known optimization target.
There are recommended configurations to mitigate this.
The observation aligns with your internal profiling results.

Thank you for your work on this important research!

The text was updated successfully, but these errors were encountered:

jiaqizhai · 2025-02-23T20:30:11Z

Hi - we actually have two copies of code in this repo that we haven't fully integrated yet. Please check https://github.com/facebookresearch/generative-recommenders?tab=readme-ov-file#efficiency-experiments for the notes. Integration will likely be done in the next couple of weeks.

mia1460 · 2025-02-25T13:58:21Z

Thank you for your previous reply. I have some further questions and confusions regarding the code and the model in your repository, and I'm hoping you can help clarify them.

1. Tensor Shape Transformation Issue

I've noticed that a significant amount of time is spent on tensor shape transformations, specifically between dense and jagged tensors. I'm curious about the rationale behind this design. Will there be any future plans to optimize these tensor shape transformations? As it currently consumes a considerable amount of resources and might affect the overall system efficiency.

2. User Sequence Inference Issue

Currently, in the existing tests, the model's input during each inference is based on the full user sequence. I'm wondering if the upcoming merged version will support inference using user incremental sequences. If so, could you provide some insights into how this will be implemented and what the potential benefits are?

I truly appreciate your time and effort in answering these questions. I'm looking forward to your response.

Best regards

jiaqizhai · 2025-02-25T21:36:04Z

The general idea for what we described in the paper is to keep all intermediate tensors in jagged/ragged formats. We provide triton/cuda kernels in this repo that should enable you to do that. For the code on public datasets, for fairness with Transformer/SASRec-based baselines (where we didn't reimplement everything in jagged), we omitted this step.

Not quite sure about what you meant by "user incremental sequences" - is that microbatching/KV caching in M-FALCON?

mia1460 · 2025-02-25T23:57:24Z

Currently, the input for encoding is the user's full sequence, which is used to generate the user's current embedding representation. If the user has a new behavior sequence, should the next input to the model be the user's current embedding representation along with the new behavior sequence, or the user's full sequence after the new addition?

Moreover, will the dlrm-v3 to be integrated keep all intermediate tensors in jagged/ragged formats?

jiaqizhai · 2025-02-28T23:06:07Z

I do not fully understand this question. GR is not a typical embedding model (due to target-aware formulation discussed in the paper) so you need either the full sequence or utilize KV caching to do partial re-computation.

The jagged optimizations should be done in https://github.com/facebookresearch/generative-recommenders/blob/main/generative_recommenders/modules/stu.py (but that code is slightly WIP).

mia1460 · 2025-03-02T02:31:22Z

Thank you for your explanation. I took another look at the code in the encode stage of the research section. Regarding your mention of "utilize KV caching to do partial re - computation", are you referring to the key point of using delta_x_offsets in the code? Even when using KV caching, does the model's input still require the full user sequence, rather than being able to take only the KV cache and the newly added sequence as in LLM? I'm not sure if my understanding is correct. Thank you very much for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant time spent on dense_to_jagged operations contradicts compute-intensive claims in paper #192

Significant time spent on dense_to_jagged operations contradicts compute-intensive claims in paper #192

mia1460 commented Feb 23, 2025

jiaqizhai commented Feb 23, 2025

mia1460 commented Feb 25, 2025

jiaqizhai commented Feb 25, 2025

mia1460 commented Feb 25, 2025 •

edited

Loading

jiaqizhai commented Feb 28, 2025

mia1460 commented Mar 2, 2025

Significant time spent on dense_to_jagged operations contradicts compute-intensive claims in paper #192

Significant time spent on dense_to_jagged operations contradicts compute-intensive claims in paper #192

Comments

mia1460 commented Feb 23, 2025

Issue Report: Significant time spent on dense_to_jagged operations contradicts compute-intensive claims in paper`

Description

Steps to Reproduce

Expected Behavior

Environment

Additional Context

Questions

jiaqizhai commented Feb 23, 2025

mia1460 commented Feb 25, 2025

1. Tensor Shape Transformation Issue

2. User Sequence Inference Issue

jiaqizhai commented Feb 25, 2025

mia1460 commented Feb 25, 2025 • edited Loading

jiaqizhai commented Feb 28, 2025

mia1460 commented Mar 2, 2025

mia1460 commented Feb 25, 2025 •

edited

Loading