Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant time spent on dense_to_jagged operations contradicts compute-intensive claims in paper #192

Open
mia1460 opened this issue Feb 23, 2025 · 6 comments

Comments

@mia1460
Copy link

mia1460 commented Feb 23, 2025

Issue Report: Significant time spent on dense_to_jagged operations contradicts compute-intensive claims in paper`

Description

Hi maintainers,

When profiling the eval_metrics_v2_from_tensors module under research code, I observed that the majority of the execution time is spent on dense_to_jagged memory copy operations rather than compute-intensive operations as mentioned in the paper.

This behavior seems contradictory to the paper's emphasis on compute-bound workloads. Specifically:

  • Observed 70-80% time spent on tensor conversion/copy operations.
  • Actual metric computations appear to occupy a smaller portion of runtime.
  • Hardware utilization (GPU/TPU) seems lower than expected.

Steps to Reproduce

  1. Run profiling on eval_metrics_v2_from_tensors with the ml-20m dataset.
  2. Use PyTorch profiler for performance analysis.
  3. Observe time distribution across operations.

Expected Behavior

Expect to see dominant time spent on:

  • Matrix factorization computations.
  • Attention operations.
  • Embedding interactions.

As suggested by the paper's computational complexity analysis.

Environment

  • PyTorch version: 2.2.1
  • Hardware: RTX A6000 48GB
  • Dataset scale: The same as ml-20m.
  • Batch size used: 128

Additional Context

Attached profiling trace screenshot:

Image
Profiling Trace

Questions

Would appreciate your insights on whether:

  1. This is a known optimization target.
  2. There are recommended configurations to mitigate this.
  3. The observation aligns with your internal profiling results.

Thank you for your work on this important research!

@jiaqizhai
Copy link
Contributor

Hi - we actually have two copies of code in this repo that we haven't fully integrated yet. Please check https://github.com/facebookresearch/generative-recommenders?tab=readme-ov-file#efficiency-experiments for the notes. Integration will likely be done in the next couple of weeks.

@mia1460
Copy link
Author

mia1460 commented Feb 25, 2025

Thank you for your previous reply. I have some further questions and confusions regarding the code and the model in your repository, and I'm hoping you can help clarify them.

1. Tensor Shape Transformation Issue

I've noticed that a significant amount of time is spent on tensor shape transformations, specifically between dense and jagged tensors. I'm curious about the rationale behind this design. Will there be any future plans to optimize these tensor shape transformations? As it currently consumes a considerable amount of resources and might affect the overall system efficiency.

2. User Sequence Inference Issue

Currently, in the existing tests, the model's input during each inference is based on the full user sequence. I'm wondering if the upcoming merged version will support inference using user incremental sequences. If so, could you provide some insights into how this will be implemented and what the potential benefits are?

I truly appreciate your time and effort in answering these questions. I'm looking forward to your response.

Best regards

@jiaqizhai
Copy link
Contributor

The general idea for what we described in the paper is to keep all intermediate tensors in jagged/ragged formats. We provide triton/cuda kernels in this repo that should enable you to do that. For the code on public datasets, for fairness with Transformer/SASRec-based baselines (where we didn't reimplement everything in jagged), we omitted this step.

Not quite sure about what you meant by "user incremental sequences" - is that microbatching/KV caching in M-FALCON?

@mia1460
Copy link
Author

mia1460 commented Feb 25, 2025

Currently, the input for encoding is the user's full sequence, which is used to generate the user's current embedding representation. If the user has a new behavior sequence, should the next input to the model be the user's current embedding representation along with the new behavior sequence, or the user's full sequence after the new addition?

Moreover, will the dlrm-v3 to be integrated keep all intermediate tensors in jagged/ragged formats?

@jiaqizhai
Copy link
Contributor

I do not fully understand this question. GR is not a typical embedding model (due to target-aware formulation discussed in the paper) so you need either the full sequence or utilize KV caching to do partial re-computation.

The jagged optimizations should be done in https://github.com/facebookresearch/generative-recommenders/blob/main/generative_recommenders/modules/stu.py (but that code is slightly WIP).

@mia1460
Copy link
Author

mia1460 commented Mar 2, 2025

Thank you for your explanation. I took another look at the code in the encode stage of the research section. Regarding your mention of "utilize KV caching to do partial re - computation", are you referring to the key point of using delta_x_offsets in the code? Even when using KV caching, does the model's input still require the full user sequence, rather than being able to take only the KV cache and the newly added sequence as in LLM? I'm not sure if my understanding is correct. Thank you very much for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants