Skip to content

[ENH] Precompute data to massively accelerate training in GPU #1850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

jobs-git
Copy link

@jobs-git jobs-git commented May 27, 2025

Proof of concept

https://drive.google.com/file/d/1go7wEMqcejRl7YecHYhIFAhO6YF3T0Kd/view?usp=sharing

Results on Training

with pre-compute: 53.662221 s
vanilla: 265.779447 s

that's about 490% performance improvements, far from 2000%, but with this update, developers can reduce cost and hardware requirements by 4x!

Results on Hyperparameter Optimization

with pre-compute: 41.021749 s
vanilla: 209.937233 s

that's about 510% performance improvements.

Description

Fixes: #1849 #1426 #1860
Partially Fixes: sktime/sktime#8278
Closes: #1846

Supersedes: #806

The primary issue here is that __getitem__ () performs pre-processing, which are typically conducted before actually training.

As a result, the GPU becomes frequently idle, resulting to slower training completion.

Its known that GPU typically achieve higher throughput the more it can be made busy, but the pre-processing done in __getitem__ () each time an item is retrieved massively impacts this.

This commit ensures that pre-processing is performed prior to training. This is achieved by the following:

  1. setting TimeSeriesDataset (..., precompute=True) will activate the call __precompute__ () function in to_dataloader()
  2. the __precompute__ () function handles the collection of pre-computed items from __item_tensor__ () and store it in a cache. this function relies on sampler to be able to retrieve data index
  3. the __item_tensor__ () is the unmodified algorithm of the original __getitem__ () to ensure equivalent outcome
  4. the new __getitem__ () retrieves items from cache in order, this is because relying on idx may result to a different index sequence due to the first sampler call from precompute ()

Note: TimeSeriesDataset defaults to slow path, so the vanilla method can still be used.

### WIP help:
~~@fkiraly ~~
- [ ] Not sure where to get the batch indices if sampler=None so I have to force implement TimeSynchronizedBatchSampler. Guidance on this will be appreciated.

Resolution: When precompute = True in TimeSeriesDataset it will use TimeSynchronizedBatchSampler regardless of sampler set in to_dataloader.

Caveat and Limitations

  • This feature stores precomputed data in RAM. Thus, enough RAM must be ensured, otherwise OOM will ensue.

    A super batching (i dont know how is that better) may be used if precompute is to be insisted, however that would be beyond the scope of this PR.

    Recommend to use: precompute=False to use the original slow path.

  • In case of ultra large scale dataset, FSDP may be utilized for pytorch-lightning distributed computing, but that is also beyond the scope of this PR.

    Recommend to use: precompute=False to use the original slow path.

  • Using precompute=True ignores two settings: batch_sampler and shuffle, as this will set the sampler to TimeSynchronizedBatchSampler and shuffle to False.

    Recommend to use: precompute=False to use the original slow path in case you would like to set those two settings.

    Can this be implemented? Yes of course, but that is beyond the scope of this PR. It would also increase complexity of using TimeSeriesDataset when a user wishes to enable precompute=True and increase memory requirements.

Checklist

  • Linked issues (if existing)
  • Amended changelog for large changes (and added myself there as contributor)
  • Added/modified tests
  • Used pre-commit hooks when committing to ensure that code is compliant with hooks. Install hooks with pre-commit install.
    To run hooks independent of commit, execute pre-commit run --all-files

The primary issue here is that `__getitem__ ()` performs pre-processing, which are typically conducted before actually training.

As a result, the GPU becomes frequently idle, resulting to slower training completion.

Its known that GPU typically achieve higher throughput the more it can be made busy, but the pre-processing done in `__getitem__ ()` each time an item is retrieved massively impacts this.

This commit ensures that pre-processing is performed prior to training. This is achieved by the following:

1. calls with `to_dataloader ()` will also call the `precompute ()` function
2. the `precompute ()` function handles the collection of pre-computed items from `__precompute__getitem__ ()` and store it in a cache. this function relies on sampler to be able to retrieve data index
3. the `__precompute__getitem__ ()` is the unmodified algorithm of the original `__getitem__ ()` to ensure equivalent outcome
4. the new `__getitem__ ()` retrieves items from cache in order, this is because relying on `idx` may result to a different index sequence due to the first sampler call from `precompute ()`
@fkiraly fkiraly added the enhancement New feature or request label May 27, 2025
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

FYI @phoeenniixx, @PranavBhatP, @xandie985 - is this something we need to take into account in v2? We should also think about adding profiling checks.

Code quality (linting) is failing, this can be fixed by using pre-commit or automating formatting. The pytorch-forecasting does not have this in the docs (we should add it), but this is the same as in sktime:
https://www.sktime.net/en/stable/developer_guide/coding_standards.html

@fkiraly
Copy link
Collaborator

fkiraly commented May 28, 2025

FYI @agobbifbk

@agobbifbk
Copy link

Agree, most of the time you can fit your data in memory and we should include the precomputation possibility in the d2 layer. We should have already the correct indexes computed, it is just a matter of create the tensors according to those indexes.

When testing with performing tuning with pytorch-lighting it is not able to perform hyperparameter optimization with this info: Failed to compute suggestion for learning rate because there are not enough points. Increase the loop iteration limits or the size of your dataset/dataloader.

When you say hyperparameter optimization you mean the learning rate using trainer.tune or do you really mean the hyperparameters of the model?

@jobs-git
Copy link
Author

Agree, most of the time you can fit your data in memory and we should include the precomputation possibility in the d2 layer. We should have already the correct indexes computed, it is just a matter of create the tensors according to those indexes.

When testing with performing tuning with pytorch-lighting it is not able to perform hyperparameter optimization with this info: Failed to compute suggestion for learning rate because there are not enough points. Increase the loop iteration limits or the size of your dataset/dataloader.

When you say hyperparameter optimization you mean the learning rate using trainer.tune or do you really mean the hyperparameters of the model?

This is not an issue anymore, I missed removing the fit function (conducting during my testing) I placed prior to the Tuner.

@phoeenniixx
Copy link
Contributor

phoeenniixx commented May 28, 2025

Hi @jobs-git, I see you are still facing some linting issues, may I suggest you try setting up pre-commit in your local repo, the process is similar to sktime as mentioned by @fkiraly
https://www.sktime.net/en/stable/developer_guide/coding_standards.html#setting-up-local-code-quality-checks

I would suggest setting up pre-commit locally and then check if your code-quality fails using

pre-commit run --files <path-to-your-files>

and this will reduce your effort very much ( it automatically solves some issues, not requiring you to make changes)

@phoeenniixx
Copy link
Contributor

you donot need to wait here and make changes based on the errors shown in code-quality ci workflows :)

Copy link

codecov bot commented May 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (main@01f97c4). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1850   +/-   ##
=======================================
  Coverage        ?   87.42%           
=======================================
  Files           ?       68           
  Lines           ?     6618           
  Branches        ?        0           
=======================================
  Hits            ?     5786           
  Misses          ?      832           
  Partials        ?        0           
Flag Coverage Δ
cpu 87.42% <100.00%> (?)
pytest 87.42% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jobs-git jobs-git changed the title [WIP] [BUG] fix precompute data to massively accelerate training in GPU [WIP] [ENH] fix precompute data to massively accelerate training in GPU May 30, 2025
@jobs-git jobs-git changed the title [WIP] [ENH] fix precompute data to massively accelerate training in GPU [ENH] fix precompute data to massively accelerate training in GPU Jun 1, 2025
@jobs-git jobs-git changed the title [ENH] fix precompute data to massively accelerate training in GPU [ENH] Precompute data to massively accelerate training in GPU Jun 1, 2025
@fkiraly
Copy link
Collaborator

fkiraly commented Jun 3, 2025

is this ready for review?

@jobs-git
Copy link
Author

jobs-git commented Jun 3, 2025

is this ready for review?

Yes, please. Inputs are welcome.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - while we wait for reviews, can you kindly add tests? For TimeSeriesDataSet in isolation, and in integration with the networks?

@jobs-git jobs-git requested a review from fkiraly June 6, 2025 17:50
Returns:
tuple[dict[str, torch.Tensor], torch.Tensor]: x and y for model
"""
if self.precompute:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why this is correct if self.precompute?

The index idx is not used at all, so this looks wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its being used under __item_tensor__ which is just a rename of the original __getitem__ its input param is not changed including it accepting the idx.

But the mechanism at which it gets idx is different.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When self.precompute=True, __item_tensor__ gets the idx first, so when this happens, the second time we receive idx in __getitem__ it will now be a different set. This will change the actual behavior of the original __getitem__ so we dont use it, instead we follow the order from self.precompute_cache which contains the correct/original order of idx from the first call.

When self.precompute=True, the order at which the idx gets obtained by __item_tensor__ is as follows:

  1. create Dataloader by calling to_dataloader
  2. this calls __precompute__
  3. __precompute__ follows TimeSynchronizedBatchSampler to get idx
  4. passes idx first sequences to __item_tensor__

Caveat:
So when Dataloader generates its own idx from its own Sampler it is not used, if we use the Dataloader idx we will NOT get the benefit of this PR, as that idx is being generated real time i guess.

So, instead of retrieving from cache, the effect is we will be back to computing original __getitem__ on each call, which is the slow path all over again, starving the GPU, technically is the source of slowness.

However, if a user wishes to use the idx of the Dataloader, the default will just skip precomputation, and get the original slow path from __getitem__ -> __item_tensor__ - which is the same vanilla __getitem__ anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When self.precompute=True, __item_tensor__ gets the idx first, so when this happens, the second time we receive idx in __getitem__ it will now be a different set.

Sorry, I still do not get it. Can you explain which dunders are called from external, and which are only used internally? I thought that __getitem__ gets called - so when it does not use idx, the information is not passed on.

Copy link
Author

@jobs-git jobs-git Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR
Does this PR use idx from Dataloader? YES and NO. If using defaults (precompute=False), then YES. But if using accelerated path (precompute=True), then NO.

EXPLANATIONS

  • Vanilla

    Dataloader -> pass idx -> __getitem__() -> use idx
  • PR - Revised vanilla - slow path (precompute=False)

    Dataloader -> pass idx -> __getitem__() -> pass idx -> 
    __itemtensor__() -> use idx

    Notes:

    • __getitem__() is revised so it can switch to vanilla and accelerated based on precompute value, in this case, its using vanilla.
    • the __itemtensor__() is just the renamed __getitem__() so its the usual calculation path when precompute=False

    Verdict:

    • idx from Dataloader is used, as usual.
  • PR - Accelerated path (precompute=True)

    Two steps happen here

    • a. Pre computation routine

      to_dataloader () -> __precompute__() -> generates idx from TSDS -> 
      loop on every idx -> pass each idx -> __itemtensor__() -> use idx -> 
      save to data to cache -> move to the next idx -> repeat

      This is essentially how Dataloader uses __getitem__(). It calls __getitem__() and pass idx to it, repeats until all idx exhaust the data. But, what I did in this PR is used the FOR LOOP to retrieve idx in advance and exhaust the data so we can pre-calculate it without calling the Dataloader.

    • b. Dataloader call

      Since we have cached the result of __itemtensor__() in advanced. Meaning __itemtensor__() has the pre-computed the data on each specific idx. We just retrieve it.

      Dataloader -> pass idx -> __getitem__() -> idx ignored ->  use cache sequence

      Note:

      • notice, we dont need to call the __itemtensor__() this time since it already did its job

      Verdict:

      • idx from Dataloader is not essential anymore, since another idx source has been used, specifically from TimeSynchronizedBatchSampler

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jobs-git, apologies, I still do not get it.

Since we have cached the result of itemtensor in advanced. meaning itemtensor has the precomputed the data on each specific idx. We just retrieve it.

As far as I understand: you are not retrieving the expected result at idx. You are retrieving the results in sequence, irrespective of which idx gets queried. Therefore, this is not the correct logic for __getitem__.

Please let me know if you disagree.

@phoeenniixx, @PranavBhatP, @agobbifbk, @fnhirwa - can you perhaps also have a look? I also do not fully understand the explanation.

Copy link
Author

@jobs-git jobs-git Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For us to be able to accelerate this (precomputation=True), we need to query the idx's in advance (not wait for Dataloader to give it one by one) to perform precomputation. This is what the PR does.

I am confused by your meaning of sequence, since idx's from TimeSynchronizedBatchSampler is always random when shuffle=True, we just saved the result sequentially so retrieval is also sequence.

As mentioned previously, acceleration requires obtaining the idx's in advance, otherwise, we can't perform pre-computation.

Copy link
Author

@jobs-git jobs-git Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fkiraly I think what you prefer implemented is obtaining the precomputed values from real-time idx? This is what I prefer too, until I saw that the algorithm seems to be normalizing on a given idx.

That is when I realized we cant accelerate this if we wait for Dataloader, so we need to obtain idx's in advance from a different mechanism. Thus this PR.

May I ask what is the actual concern if the accelerated path uses a different mechanism to obtain random idx?

Of course, the exact training trajectory would likely be different, but a properly trained model should resolve to the same/similar outcome. If I could present prediction results, would that settle the concern for accelerated path using a different random idx generator? Just let me know.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - can you explain the logic in the self.precompute branch of __getitem__? It does not look correct to me.

@jobs-git
Copy link
Author

jobs-git commented Jun 7, 2025

Its being used under __item_tensor__ which is just a rename of the original __getitem__ its input param is not changed including it accepting the idx.

But the mechanism at which it gets idx is different.

When self.precompute=True, __item_tensor__ gets the idx first, so when this happens, the second time we receive idx in __getitem__ it will now be a different set. This will change the actual behavior of the original __getitem__ so we dont use it, instead we follow the order from self.precompute_cache which contains the correct/original order of idx from the first call.

When self.precompute=True, the order at which the idx gets obtained by __item_tensor__ is as follows:

  1. create Dataloader by calling to_dataloader
  2. this calls __precompute__
  3. __precompute__ follows TimeSynchronizedBatchSampler to get idx
  4. passes idx first sequences to __item_tensor__

Caveat:
So when Dataloader generates its own idx from its own Sampler it is not used, if we use the Dataloader idx we will NOT get the benefit of this PR, as that idx is being generated real time i guess.

So, instead of retrieving from cache, the effect is we will be back to computing original __getitem__ on each call, which is the slow path all over again, starving the GPU, technically is the source of slowness.

However, if a user wishes to use the idx of the Dataloader, the default will just skip precomputation, and get the original slow path from __getitem__ -> __item_tensor__ - which is the same vanilla __getitem__ anyway.

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 8, 2025

ok, I think I am starting to get it. @jobs-git, would it be possible to produce code for a basic usage examples?

Then, for each call of __getitem__ and __itemtensor__ , list which idx the different functions are called with? For the precompute=False and precompute=True case.

@jobs-git
Copy link
Author

jobs-git commented Jun 8, 2025

ok, I think I am starting to get it. @jobs-git, would it be possible to produce code for a basic usage examples?

Then, for each call of __getitem__ and __itemtensor__ , list which idx the different functions are called with? For the precompute=False and precompute=True case.

Yes, of course, its possible to add sample code, regarding the term precompute, is that ok? I think it should be have been preprocess, but I noticed there is a usage of preprocess already, so I did not use it to avoid confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Slow model training with GPU [ENH] Improve performance by 2000% by directly batching without using dataloader [BUG] Low GPU Utilization
4 participants